• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Programmer thoughts

Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
I'd like a thread for low-effort to medium-effort programming thoughts. If your thought doesn't deserve a topic, feel free to post it here! Any subject, any language, any skill level.

---------

To start off the topic, I'm studying "constraint programming" for a Puyo Puyo AI. Its an obscure match-4 video game with a deeply dedicated community. I have a feeling that I can represent Puyo Puyo boards as a constraint solving problem (similar to 3-SAT). As such, I've been kinda playing around with this (assumed) NP-complete, or harder (#P-complete problems).

Anyone who has completed Comp. Sci. algorithms courses knows that NP Complete (or harder classes: such as #P Complete) are "unscalable", its assumed impossible to invent an algorithm that can solve this problem in a scalable manner. And as an undergraduate, I mistakenly thought that this means that if you ever proven (or assumed) a problem to be NP-complete or harder, that you should "give up" and find another problem.

This was a mistaken thought. NP Complete (and harder) problems scale very poorly. But that means that all problems below a certain size complete almost instantly.

Just because you have an NP-complete problem (or harder) doesn't mean you're hopeless. If you can prove that your problem's scope is limited, then the problem can suddenly become solvable.

Consider Chess exploration or Chess AIs. Trying to solve Chess is infeasible. But if you limit the problem's scope to a certain size, such as "All Board positions with 7 or fewer pieces", it suddenly becomes possible to solve. The 7-man Sygzy Tablebase has been completed, exhaustively proving and solving all possible chess positions of a certain size (7-pieces): https://syzygy-tables.info/

Another chess-example is the Perft problem: calculating and enumerating all possible openings. Perft(11) has been solved in less than an hour on modern hardware, and perft(15) has been claimed to be solved by some people. This problem scales poorly (exponentially more moves every perft(n+1)), but drop down just a few numbers to Perft(10, and you pretty much solve the problem instantly on modern hardware.

In the case of Puyo-puyo, I'm not necessarily trying to fully solve PuyoPuyo, but instead trying to make a super-human AI that can chain better than any human. (Which is rather difficult, humans are really, really good at making chains in this game).

I guess: my initial post is... "NP Complete" or harder (#P complete) problems are not hopeless! You can still make useful programs in the face of these unscalable problems. Learning how to write code that tackles this extremely hard class of problems can be a joy. Your goal is to solve as large of a problem as possible.
 
Last edited:

W1zzard

Administrator
Staff member
Joined
May 14, 2004
Messages
28,004 (3.71/day)
Processor Ryzen 7 5700X
Memory 48 GB
Video Card(s) RTX 4080
Storage 2x HDD RAID 1, 3x M.2 NVMe
Display(s) 30" 2560x1600 + 19" 1280x1024
Software Windows 10 64-bit
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
Another goal can be to be "good enough", or even "better than nothing", also AI

Good point.

A good demonstration of "good enough" is the following piece of code:

Code:
$ cat test.c
int main(){
    for(int i=1; i!=0; i++);
    // Time how long it takes to exhaustively explore a 32-bit number
}

$ gcc -O1 test.c; time ./a.out

real    0m2.107s
user    0m2.107s
sys    0m0.000s

That's right, you can exhaustively run through all 32-bit numbers in less than 5-seconds these days. It shouldn't be a surprise: the 32-bit space is ~4-billion, and modern CPUs have a heartbeat of 4GHz (4-billion per second).

If you're unable to exhaustively search the search-space entirely... randomly searching 4-billion spots and then picking "Best of 4-billion" is often "good enough" in these kinds of problems. Ex: Explore 4-billion random solutions to "Traveling Salesman", and pick the best solution you found. Its not likely to be the true optimal solution, but its probably good enough for most purposes.

---------

If you use multiple threads, or GPUs to explore the state-space, you might be surprised how good modern computers are at brute force.
 
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
I guess I should dump this here, this is more programmers humor than anything:

I was trying to correct a piece of code in a KSP related mod. The purpose of this code is to walk up a list/tree of bodies orbits to find and return a root star. I submitted the following in a stroke of latenight, braindead genius:


/// <summary>
/// Returns the host star directly from the given body.
/// </summary>
public static CelestialBody GetLocalStar(CelestialBody body)
{
while (body?.orbit?.referenceBody != null)
{
if (body.orbit.referenceBody.isStar)
{
body = body.orbit.referenceBody;
}
}

return body;
}

Yeah, that's pretty much the most messed up loop I've ever submitted. I may as well be drunk. First it checks if the reference body (the thing it's orbiting ) is not null (meaning body is not a star, not what we want). Then it enters a while loop that only advances under a select set of circumstances. In short, if you feed a moon to this code, it will infinite loop that while and crash the ENTIRE game.

And yes, I submitted it as an official pull commit to my hobby project, to the mocking of my peers:


Never code at work while sleep deprived. It's bad kids.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
John Lakos's "Large Scale C++" was one of the best written 90s books on software engineering.

Hearing that Lakos has made a 2nd edition, updated to year 2020 sensibilities, made me instantly buy the new book. However, Lakos is going to split the new 2nd edition into two books, only the first book has been written so far. My copy arrived today, so I've been giving it a skim through.

Great book, would recommend to any C++ programmer. I eagerly await for the 2nd half for next year.
 
Joined
Dec 24, 2008
Messages
2,062 (0.35/day)
Location
Volos, Greece
System Name ATLAS
Processor Intel Core i7-4770 (4C/8T) Haswell
Motherboard GA-Z87X-UD5H , Dual Intel LAN, 10x SATA, 16x Power phace.
Cooling ProlimaTech Armageddon - Dual GELID 140 Silent PWM
Memory Mushkin Blackline DDR3 2400 997123F 16GB
Video Card(s) MSI GTX1060 OC 6GB (single fan) Micron
Storage WD Raptors 73Gb - Raid1 10.000rpm
Display(s) DELL U2311H
Case HEC Compucase CI-6919 Full tower (2003) moded .. hec-group.com.tw
Audio Device(s) Creative X-Fi Music + mods, Audigy front Panel - YAMAHA quad speakers with Sub.
Power Supply HPU-4M780-PE refurbished 23-3-2022
Mouse MS Pro IntelliMouse 16.000 Dpi Pixart Paw 3389
Keyboard Microsoft Wired 600
Software Win 7 Pro x64 ( Retail Box ) for EU
Consider Chess exploration or Chess AIs. Trying to solve Chess is infeasible. But if you limit the problem's scope to a certain size, such as "All Board positions with 7 or fewer pieces", it suddenly becomes possible to solve.

This is much close to the concept of proper thinking when repairing electronics, but still you need to be a master in awareness of its piece properties.
And the most important lesson that no school delivers, this is of how you to control the range of your own focus over a problem.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
If you haven't programmed a GPU yet, its a lot of fun and I suggest everyone to try it out. Its actually very intimidating to try to reach the base level of utilization of a GPU. GPUs have a large number of "threads", for example: AMD's Vega64 has 64-compute units, and each compute unit has 64-vALUs in it, and each vALU requires 4-threads before its fully utilized.

That's right: you need 16384 Threads to fill up a Vega64 to occupancy 1.

But that's not all: Much like "Hyperthreading" on Intel Skylake or SMT on AMD Zen, GPUs (both NVidia and AMD) support multiple "software threads" per "core". It is recommended to reach at least occupancy 4 (that's 4-threads per core), meaning you really should be aiming at 65536 threads running on a Vega64.

Vega64 supports up to 10-threads per hardware resource, meaning a total of 163840 threads are possible. This is significantly oversubscribed however, and it seems unlikely to me that 40-threads per vALU would be efficient (even if the system supports it, it doesn't make it a good idea). Occupancy 4 seems to be the general number to aim at.

To emphasize how intimidating Occupancy 4 is on Vega64, there is 8GBs of VRAM. Occupancy 4 provides the programmer a total of 128kB of VRAM per thread. You're far below the "640kB is enough for everybody" line. Now of course, GPUs aren't designed to really have "independent" threads working on independent memory by themselves. Instead, GPUs are about SIMD-collaboration on the same problem sets (ex: pixel shaders working on the same 1080x1920 screenbuffer to display a video game).

As such: SIMD programming is a very different feel than normal programming. NVidia may market that "SIMT" allows you to think like "a bunch of threads", but the whole GPU methodology couldn't be more different! SIMD is strange, but wonderful. You have HUGE compute potential at your disposal, especially good for these "brute force" problems I've been talking about in this thread so far.
 
Joined
Jun 24, 2013
Messages
58 (0.01/day)
Don't read the manuals; front to back.
My VB5 search engine uses maybe 10 or 15% of the instruction set.
The other 85 or 90 percent are just a confusion.. With no need to study or remember.
Programming is about Extract, Sort and Report.
and keeping it simple stupid.
Flowcharting would have solved the Max 8 software bug. (By playing computer)
Don't think do..
 
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
So my mod got optimized a little:
BEFORE:

AFTER:

Yes, those are FPS counters... oof. What did I learn today?

It turns out even draw distance cullers should be rate limited, or your entire CPU will be dedicated to deciding what to cull. That is all.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
It turns out even draw distance cullers should be rate limited, or your entire CPU will be dedicated to deciding what to cull. That is all.

Code too busy optimizing... so you can't run faster at all.

What does your mod do exactly? I don't play KSP, but I'd be interested in hearing more about your project. Despite talking about GPUs all the time, I don't actually do graphics programming.
 
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
Code too busy optimizing... so you can't run faster at all.

What does your mod do exactly? I don't play KSP, but I'd be interested in hearing more about your project. Despite talking about GPUs all the time, I don't actually do graphics programming.
It's a solar system replacer plugin. Doesn't do anything on it's own but allows config based solar system replacement in Kerbal Space Program.

I'm not the creator mind, just the only remaining maintainer. It's pretty big in scope. It pretty much rejects the stock solar system, and then proceeds to rewrite everything relating to it in a dll file that is used instead, allowing us to mod it using text config files. We also implement support for other things stock KSP doesn't have such as multistar (binary, trinary, etc) systems etc and the related solar flux logic.

The mod is Kopernicus:


Beta branch supports more versions:


This amounts to rewriting about 25% of the game, lol.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)

With my day-job finally slowing down, I'm able to start playing with code again at home. My last problem was GPU-memory management, so I've begun to research memory management algorithms. I'm sure people have already studied malloc / new, but... I actually think my use-case is a potentially useful situation for Mark-and-Sweep "Stop the World" garbage collectors.

In any case, the above garbage-collector blogpost was very useful at explaining mark-and-sweep and the tri-color invariant.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)

There are PDFs of this online, but I'm not sure about their legality. I have an ACM subscription and can easily get the legal version. If you really wanted to, you'd find the PDFs rather quickly :)

So there are about 4 styles of garbage collection common in 1992: Reference Counting, Mark-Sweep, Mark-Compact, and Copying. Reference Counting wasn't popular in `92, but is now popular in Rust, C++ (shared_ptr<>) and Python. Copying collectors seem to be the most popular in Java / C# these days (but with many intricate optimizations). Mark-Sweep is popular in Go.

Most of the complexities in GC seem to be in concurrent / realtime implementations. "Stop the World" is bad for most applications, so most modern algorithms hide the "Stop the World" in a concurrent thread, requiring multiple-read / single writer or multiple-read / multiple-writer concurrency (depending on the specific style chosen). As noted earlier: I plan to avoid these complications by simply embracing "Stop the World".

----------

In the GPU-world, you execute a kernel: spawning many blocks (composed of dozens of SIMD-threads). All blocks then finish execution as the kernel cleans up. That shares many similarities to 'Stop the World', once the kernel is finished, you know that no GPU-thread is executing on the data anymore, and that its safe to garbage collect.

However, I now have an issue: "Stop the World" collectors stop the world when they run out of memory on a "malloc" call. Since no memory is available, "all threads stop", and garbage collection begins. This is very useful in VMs where you can just "save state" (Java-registers or C# Bytecode), but GPU Kernels can't be interrupted like that. That leaves me with two choices:

1. Create a VM-like "save state" for all GPU threads: allow a kernel to "exit early" on any malloc() call as the malloc may not find enough memory (without a garbage collect anyway). In effect: GPU kernels will have 2 return codes: #1: malloc() failed, so please run garbage collect and rerun the kernel afterwards. Or #2: computation is finished, data is in the final state.

2. Predict the number of mallocs and garbage collect "earlier".

Neither seems easy. But both approaches seem efficient if I implement it correctly. #1 has similarities to continuations and other low-level details. The split between "VM" and 'Physical SIMD Thread' could be useful for other reasons. #2 is a more direct way to solve the problem, but I'm not entirely sure how to "predict the future". If I knew how much the threads were going to malloc, I wouldn't need a dynamic malloc call to begin with!

I guess that forces me to use #1, unless there's another approach...

EDIT: I just thought of #2: the AMB operator (http://www.randomhacks.net/2005/10/11/amb-operator/). You don't "guess", you actually execute. When a malloc() fails, you backtrack (just like the AMB operator) to the nearest savestate. Then, garbage collect, and then resume. That might actually be easier than doing #1 (which is basically creating savestates)... #2 you can have a simple checkpointing system at easy-to-define points.

----------

Section 6 was the most informative: discussing low-level details like how to lay out the headers and some CPU-time / Space tradeoffs. Because I plan to use a GPU: I'll have tons of CPU-time but very little memory-bandwidth available. More CPU-heavy approaches that use less space probably would be beneficial.

-----------

Why am I interested in GC on a GPU? Well, simple really. The malloc() and free() calls are going to be called in severely multithreaded situations. Dozens of concurrent mallocs() and/or frees() going on simultaneously. GC solves this concurrency problem by making a separate stage where the garbage is collected... this separate stage has better concurrency guarantees than when the code is running on its own.
 
Last edited:
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
I'll take "stuff the decompiler spit out" for $250, Trebek.

off.png

See all those reciprocals? Yeah, they simplify out to basically x * y, but it thought it was better to state it as 1 / ((1 / x) * (1 / y)). God that's weird.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
I'll take "stuff the decompiler spit out" for $250, Trebek.

View attachment 187366
See all those reciprocals? Yeah, they simplify out to basically x * y, but it thought it was better to state it as 1 / ((1 / x) * (1 / y)). God that's weird.

I'm going to bet that SP.flowRate is often used as a rate, and therefore is probably being stored in reciprocal form.

You have a ton of "foo/flowRate" all over the code, to the point where the compiler has chosen to store 1/flowrate instead of flowrate.
 
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
I'm going to bet that SP.flowRate is often used as a rate, and therefore is probably being stored in reciprocal form.

You have a ton of "foo/flowRate" all over the code, to the point where the compiler has chosen to store 1/flowrate instead of flowrate.
Probably. It's not my code, but that is the likely origin.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
Probably. It's not my code, but that is the likely origin.

I'm not an expert on floating-point arithmetic. But 1/sqrt(x) might be faster than sqrt(x). (And a bunch of other operations in the reciprocal). Something to do with speed of convergence, Newton's Method, error bounds and the like.

So my alternative answer is... "sqrt()" is somewhere in the code, and the compiler has decided that keeping things in reciprocal form for a slightly faster 1/sqrt() is more efficient than doing the normal sqrt() operations. Reciprocals aren't fast, but they aren't slow either. So its possible that 1/blah all over the place (and then doing 1/sqrt() in various locations) is superior over using sqrt(). If not faster, then possibly a lower error bound or something along those lines.

Yeah, floating point math is complicated and sucks.
 
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
boy, it's part of an astronomy sim, so you don't need to tell me floating point sucks. Single precision losses are unbearable there, doubles are even close to their limit at times...
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
boy, it's part of an astronomy sim, so you don't need to tell me floating point sucks. Single precision losses are unbearable there, doubles are even close to their limit at times...

Soooooo.... what do you think of using FP16 or BFloat16 for the size-fields of my memory manager (with garbage collection)?

:)

Seriously though: size-classes in memory management is pretty arbitrary. Google's tcmalloc implementation has a very arbitrary list:


Code:
const SizeClassInfo SizeMap::kExperimentalSizeClasses[SizeMap::kExperimentalSizeClassesCount] = {
    // <bytes>, <pages>, <batch size>    <fixed>
    {        0,       0,           0},  // +Inf%
    {        8,       1,          32},  // 0.59%
    {       16,       1,          32},  // 0.59%
    {       24,       1,          32},  // 0.68%
    {       32,       1,          32},  // 0.59%
    {       40,       1,          32},  // 0.98%
    {       48,       1,          32},  // 0.98%
    {       64,       1,          32},  // 0.59%
    {       72,       1,          32},  // 1.28%
    {       80,       1,          32},  // 0.98%
    {       88,       1,          32},  // 0.68%
    {       96,       1,          32},  // 0.98%
    {      104,       1,          32},  // 1.58%
    {      112,       1,          32},  // 0.78%
    {      120,       1,          32},  // 0.98%
    {      128,       1,          32},  // 0.59%
    {      136,       1,          32},  // 0.98%

SNIP


    {    13568,       5,           4},  // 0.75%
    {    14336,       7,           4},  // 0.08%
    {    16384,       2,           4},  // 0.29%
    {    20480,       5,           3},  // 0.12%
    {    24576,       3,           2},  // 0.20%
    {    28672,       7,           2},  // 0.08%
    {    32768,       4,           2},  // 0.15%
    {    40960,       5,           2},  // 0.12%
    {    49152,       6,           2},  // 0.10%
    {    57344,       7,           2},  // 0.08%
    {    65536,       8,           2},  // 0.07%
    {    73728,       9,           2},  // 0.07%
    {    81920,      10,           2},  // 0.06%
    {    90112,      11,           2},  // 0.05%
    {    98304,      12,           2},  // 0.05%
    {   114688,      14,           2},  // 0.04%
    {   131072,      16,           2},  // 0.04%
    {   147456,      18,           2},  // 0.03%
    {   163840,      20,           2},  // 0.03%
    {   180224,      22,           2},  // 0.03%
    {   204800,      25,           2},  // 0.02%
    {   229376,      28,           2},  // 0.02%
    {   262144,      32,           2},  // 0.02%

Etc. etc. You get the gist. 86 size classes, with a non-obvious difference between them (kinda-sorta exponential... but mostly +8 early on)

Anyway, why not formalize this into exactly 256-size classes by making an 8-bit floating point number that traverses the necessary size classes? 256kB is the biggest size class, so really a 4-bit exponent + 4-bit mantissa could represent all the numbers between 8 to 256kB in just a single 8-bit number.

Standard "2^exponent * 1.mantissa" that half/float/double all do, except fractions are not needed (so 2^exponent is literal. No need to do a "minus 15" or "minus 2048" exponent bias or whatever crazy stuff happens in halfs / float / double). 2^exponent literally is all you need (though exponent = 0 / mantissa = 0 results in a number of 1.0... ignore that little qualm, lol).

So yeah, 32-bit floats and 64-bit floats are not precise enough for astronomy. But 8-bit floats (4exponent + 4mantissa) might be precise enough for memory management. Think about it...

You want to "round" similar sizes together for the sake of fragmentation. If there's a 20480 byte chunk available, but the user asks for 19000 bytes, you give them the 20480-byte chunk.
 
Last edited:
Joined
Aug 20, 2007
Messages
21,624 (3.40/day)
Location
Olympia, WA
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory 64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
Soooooo.... what do you think of using FP16 or BFloat16 for the size-fields of my memory manager (with garbage collection)?
I'm not the programmer I used to be. I use what C# or java throws at me and call it a day, lol
 

Aquinus

Resident Wat-man
Joined
Jan 28, 2012
Messages
13,173 (2.78/day)
Location
Concord, NH, USA
System Name Apollo
Processor Intel Core i9 9880H
Motherboard Some proprietary Apple thing.
Memory 64GB DDR4-2667
Video Card(s) AMD Radeon Pro 5600M, 8GB HBM2
Storage 1TB Apple NVMe, 4TB External
Display(s) Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case MacBook Pro (16", 2019)
Audio Device(s) AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply 96w Power Adapter
Mouse Logitech MX Master 3
Keyboard Logitech G915, GL Clicky
Software MacOS 12.1
I like not thinking about these things if they can be avoided so I can focus on solving the business domain problem as opposed to the comp. sci. problem. I use Clojure because it's easier to implement solutions and check correctness, not because it's necessarily the most computationally efficient option. For example, it's a lot easier for me to reason about business logic with a language that's referentially transparent, because I care about checking equity of values, not references to memory locations. Same deal with number precision. If an integer overflows, it's nice to have a language that will move it into being a long, then a BigInteger (in JVM land.) Not to mention that the JVM runs practically everywhere.
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
A few things.

1. GPUs do have a "GPU Malloc" call (erm... cudaMalloc), but it comes with a number of disadvantages. Most importantly: its globally synchronous. So your 4000+ cuda-threads on the GPU grind to a halt whenever they come across a default cudaMalloc call.

2. #1 can be avoided by simply issuing cudaMalloc at the beginning of your code's execution, and then manually splitting up memory inside of your kernels (a kernel is basically a function-call that runs in your GPU instead of the CPU). However, this requires a memory-management layer inside of your kernels.

3. GPUs are extremely parallel. The smallest unit on NVidia is a 32-thread warp, the GPU itself always executes 32-threads at a time. (Or: think of it as a 32-wide SIMD, like how AVX is 8 x 32-bits, GPUs are 32x32-bits). This actually makes GPUs extremely efficient in a variety of algorithms... most of which haven't been discussed in the mainstream programming journals (!!).

4. For example: I believe that GPUs can implement Cheney's semispace collector algorithm very efficiently. All 32-threads get up to 32x objects, and performs a breadth-first-search across them. All 32-breadth-first-searches can be accomplished in parallel (subject to a few obvious corner cases: such as when one object is visited by two or more threads at the same time). These corner cases can be resolved in __shared__ memory extremely efficiently: using a GPU-sorting network (https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html). Heck, it probably can be solved with a semi-sort even more efficiently (https://dl.acm.org/doi/10.1145/2755573.2755597).

Its clear that Cheney's semispace allocator / collector is a clear candidate for GPU-parallelism. (https://en.wikipedia.org/wiki/Cheney's_algorithm). Aka: a form of garbage collection ("true" garbage collection to boot). It requires knowing the latest-and-greatest GPU algorithms to implement efficiently, but those algorithms are already in libraries like CUB.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
So Knuth's "Volume 4" (which is now something like 7 books long lol) of "Art of Computer Programming" is extremely relevant to the code I'm trying to write here.

I had Fascicle 6 (SAT Solvers), but its really specific to the SAT problem. Which... is pretty general but its not how I want to program things. Volume 4A is an excellent introduction to combinatorics: exhaustively giving many ideas into n-tuples, permutations, combinations, partitions, and trees.

However, Volume 4 Fascicle 5: "Backtrack Programming / Dancing Links" has been the most relevant piece of work from Knuth with regards to my personal project here. I know this sounds stupid simple: but the gist is to have a "state" (where the "state" is some kind of advanced data-structure that represents the search space + the associated data structures needed to quickly perform operations in this search space), and then a "do" and "undo" actions.

When you move to explore the space, you push "undo" onto a stack. When you backtrack, you pop from the stack and carry out the undo action.

Here's the key: memory operations on modern computers are extremely expensive. (If you didn't know: reading/writing to DDR4 on a modern computer is 50 nanoseconds, or 200 cycles). So you want to do these operations with as few "mems" as possible (where a "mem" is a main-memory read or main-memory write). It is obvious that you can push your entire data-structure onto the stack as you explore the space... but if your data-structure of state is even just a few hundred bytes long, you're unnecessarily writing (and later: reading back) memory.

In practice, its far cheaper to push / pop representative actions. Knuth early in the book pushes/pops memory-reads and writes.

For example: if mem[5000] = 20, Knuth writes into the stack push(mem[5000]); push(5000) and then mem[5000] = 20. Later, mem(pop()) = pop() to undo the action.

If your data-structure is 5000-bytes long, then you'd originally have to push 5000 bytes onto the stack. But by only pushing the "changes" of the datastructure onto the stack, you can often save on mems (especially because of how much compute you can do today per mem: over 200 cycles... its probably worth the tradeoff).

-------------

"Dancing Links" is an even more advanced methodology where a linked list is used for this backtracking. For the problem of the exact cover problem (a famous NP-complete problem), it turns out you can perform the do and undo operations on a linked list without a stack and without any dynamic memory at all (!!!!). Yes, its a statically allocated linked list. Its... difficult for me to explain, really just best to read the book if you're interested. But its brilliant IMO.

Even if I don't plan on using the "Dancing Links" technique, I'm glad to have learned it.

------------

The rest of the book is about using Dancing Links in a SAT-solver like way. Knuth generalizes the exact cover problem into "exact cover with colors", and then shows how many puzzle / NP-complete toy problems are easily translated into the "exact cover with colors" problem. Because dancing links is so efficient and general purpose... and because of the nature of NP complete problems (any NP complete problem is provably "convertable" to other NP complete problems in Polynomial time), the dancing links methodology can be used in the "general case" to solve many puzzles. At least half of the text is on techniques and examples for how to "shove" various problems into "exact cover with colors".
 
Joined
Apr 24, 2020
Messages
2,781 (1.61/day)
I'm reading Volume 4 backwards, lol.

Volume 4 Fascicle 5 "Backtracking" was very inspiring, but a lot of concepts were built up from stuff earlier in the "volume". So I went back to reading through Volume 4A. The section on BDDs (binary decision diagrams) was incredible. Its an amazing data-structure, and Knuth is right to hype it. Doing more research on the subject led me to MDDs (multi-valued decision diagrams), the relational-equivalent to BDDs. Without getting too technical here, a BDD is a highly compressed representation of a binary-truth table. An MDD is the same format applied to a relational-table: that is, extending the concept out from 0s and 1s into 0s, 1s, 2s, 3s... etc. etc. (for as many values as you need).

All MDDs could be represented as a BDD, but MDDs are more natural in some problems. Kinda like 3SAT vs Constraint Programming: 3SAT is provably equivalent but some things are just easier to understand in the constraint-programming model instead. Or maybe Binary Tree vs B-Tree: the Binary Tree is basically the same as a B-Tree but in practice, the fewer number of "pointer" links in the B-Tree can lead to more efficiency for some problems.

Alas: Knuth does not talk about MDDs. I have a feeling that MDDs are a "simple extension", but... rather than risking it, I'm buying another textbook on the subject of BDDs specifically.

--------

If you ever need to represent boolean truth tables with a moderate number of variables (ex: 128-bit truth table, such as the "35th bit of a 64x64 bit multiplication circuit"), BDDs seem to be the answer. (And in the case of multiplication, with 64-bits worth of outputs, you'd use 64x BDDs as a collective). In practice, BDDs are the data-structure used for CPU-verification algorithms, proving that multiplication circuits are equivalent to what's expected. There's many, many BDDs-like data structures: Knuth also has a section on ZDDs (Zero-suppressed Decision Diagrams), a more recent invention in 1993, but otherwise similar in concept to classic BDDs. Many researchers are putting their own spin on things.
 
Top