# Programmer thoughts



## dragontamer5788 (Jul 9, 2020)

I'd like a thread for low-effort to medium-effort programming thoughts. If your thought doesn't deserve a topic, feel free to post it here! Any subject, any language, any skill level.

---------

To start off the topic, I'm studying "constraint programming" for a Puyo Puyo AI. Its an obscure match-4 video game with a deeply dedicated community. I have a feeling that I can represent Puyo Puyo boards as a constraint solving problem (similar to 3-SAT). As such, I've been kinda playing around with this (assumed) NP-complete, or harder (#P-complete problems).

Anyone who has completed Comp. Sci. algorithms courses knows that NP Complete (or harder classes: such as #P Complete) are "unscalable", its assumed impossible to invent an algorithm that can solve this problem in a scalable manner. And as an undergraduate, I *mistakenly* thought that this means that if you ever proven (or assumed) a problem to be NP-complete or harder, that you should "give up" and find another problem.

This was a mistaken thought. NP Complete (and harder) problems *scale* very poorly. But that means that *all problems below a certain size complete almost instantly*.

Just because you have an NP-complete problem (or harder) doesn't mean you're hopeless. If you can prove that your problem's scope is limited, then the problem can suddenly become solvable.

Consider Chess exploration or Chess AIs. Trying to solve Chess is infeasible. But if you limit the problem's scope to a certain size, such as "All Board positions with 7 or fewer pieces", it suddenly becomes possible to solve. The 7-man Sygzy Tablebase has been completed, exhaustively proving and solving all possible chess positions of a certain size (7-pieces): https://syzygy-tables.info/

Another chess-example is the Perft problem: calculating and enumerating all possible openings. Perft(11) has been solved in less than an hour on modern hardware, and perft(15) has been claimed to be solved by some people. This problem scales poorly (exponentially more moves every perft(n+1)), but drop down just a few numbers to Perft(10, and you pretty much solve the problem instantly on modern hardware.

In the case of Puyo-puyo, I'm not necessarily trying to fully solve PuyoPuyo, but instead trying to make a super-human AI that can chain better than any human. (Which is rather difficult, humans are really, really good at making chains in this game).

I guess: my initial post is... "NP Complete" or harder (#P complete) problems are not hopeless! You can still make useful programs in the face of these unscalable problems. Learning how to write code that tackles this extremely hard class of problems can be a joy. Your goal is to solve as large of a problem as possible.


----------



## W1zzard (Jul 9, 2020)

dragontamer5788 said:


> Your goal is to solve as large of a problem as possible.


Another goal can be to be "good enough", or even "better than nothing", also AI


----------



## dragontamer5788 (Jul 9, 2020)

W1zzard said:


> Another goal can be to be "good enough", or even "better than nothing", also AI



Good point.

A good demonstration of "good enough" is the following piece of code:


```
$ cat test.c
int main(){
    for(int i=1; i!=0; i++);
    // Time how long it takes to exhaustively explore a 32-bit number
}

$ gcc -O1 test.c; time ./a.out

real    0m2.107s
user    0m2.107s
sys    0m0.000s
```

That's right, you can exhaustively run through all 32-bit numbers in less than 5-seconds these days. It shouldn't be a surprise: the 32-bit space is ~4-billion, and modern CPUs have a heartbeat of 4GHz (4-billion per second).

If you're unable to exhaustively search the search-space entirely... randomly searching 4-billion spots and then picking "Best of 4-billion" is often "good enough" in these kinds of problems. Ex: Explore 4-billion random solutions to "Traveling Salesman", and pick the best solution you found. Its not likely to be the true optimal solution, but its probably good enough for most purposes.

---------

If you use multiple threads, or GPUs to explore the state-space, you might be surprised how good modern computers are at brute force.


----------



## R-T-B (Jul 21, 2020)

I guess I should dump this here, this is more programmers humor than anything:

I was trying to correct a piece of code in a KSP related mod.  The purpose of this code is to walk up a list/tree of bodies orbits to find and return a root star.  I submitted the following in a stroke of latenight, braindead genius:


        /// <summary>
        /// Returns the host star directly from the given body.
        /// </summary>
        public static CelestialBody GetLocalStar(CelestialBody body)
        {
            while (body?.orbit?.referenceBody != null)
            {
                if (body.orbit.referenceBody.isStar)
                {
                    body = body.orbit.referenceBody;
                }
            }

            return body;
        }

Yeah, that's pretty much the most messed up loop I've ever submitted.  I may as well be drunk.  First it checks if the reference body (the thing it's orbiting ) is not null (meaning body is not a star, not what we want).  Then it enters a while loop that only advances under a select set of circumstances.  In short, if you feed a moon to this code, it will infinite loop that while and crash the ENTIRE game.

And yes, I submitted it as an official pull commit to my hobby project, to the mocking of my peers:









						Fix the GetBodyReferencing override mess.  Impelement new GetLocalSta… · R-T-B/Kopernicus@6a6d9a5
					

…r function for use for atmospheric temp calc.




					github.com
				




Never code at work while sleep deprived.  It's bad kids.


----------



## dragontamer5788 (Aug 7, 2020)

John Lakos's "Large Scale C++" was one of the best written 90s books on software engineering.

Hearing that Lakos has made a 2nd edition, updated to year 2020 sensibilities, made me instantly buy the new book. However, Lakos is going to split the new 2nd edition into two books, only the first book has been written so far. My copy arrived today, so I've been giving it a skim through.

Great book, would recommend to any C++ programmer. I eagerly await for the 2nd half for next year.


----------



## kiriakost (Aug 7, 2020)

dragontamer5788 said:


> Consider Chess exploration or Chess AIs. Trying to solve Chess is infeasible. But if you limit the problem's scope to a certain size, such as "All Board positions with 7 or fewer pieces", it suddenly becomes possible to solve.



This is much close to the concept of proper thinking when repairing electronics,  but still you need to be a master in awareness of its piece properties. 
And the most important lesson that no school delivers, this is of how you to control the range of your own focus over a problem.


----------



## dragontamer5788 (Sep 15, 2020)

If you haven't programmed a GPU yet, its a lot of fun and I suggest everyone to try it out. Its actually very intimidating to try to reach the base level of utilization of a GPU. GPUs have a large number of "threads", for example: AMD's Vega64 has 64-compute units, and each compute unit has 64-vALUs in it, and each vALU requires 4-threads before its fully utilized.

That's right: you need 16384 Threads to fill up a Vega64 to occupancy 1. 

But that's not all: Much like "Hyperthreading" on Intel Skylake or SMT on AMD Zen, GPUs (both NVidia and AMD) support multiple "software threads" per "core". It is recommended to reach at least occupancy 4 (that's 4-threads per core), meaning you really should be aiming at 65536 threads running on a Vega64.

Vega64 supports up to 10-threads per hardware resource, meaning a total of 163840 threads are possible. This is significantly oversubscribed however, and it seems unlikely to me that 40-threads per vALU would be efficient (even if the system supports it, it doesn't make it a good idea). Occupancy 4 seems to be the general number to aim at.

To emphasize how intimidating Occupancy 4 is on Vega64, there is 8GBs of VRAM. Occupancy 4 provides the programmer a total of 128kB of VRAM per thread. You're far below the "640kB is enough for everybody" line. Now of course, GPUs aren't designed to really have "independent" threads working on independent memory by themselves. Instead, GPUs are about SIMD-collaboration on the same problem sets (ex: pixel shaders working on the same 1080x1920 screenbuffer to display a video game).

As such: SIMD programming is a very different feel than normal programming. NVidia may market that "SIMT" allows you to think like "a bunch of threads", but the whole GPU methodology couldn't be more different! SIMD is strange, but wonderful. You have HUGE compute potential at your disposal, especially good for these "brute force" problems I've been talking about in this thread so far.


----------



## SpectateSwampBANNED (Jan 4, 2021)

Don't read the manuals; front to back.
My VB5 search engine uses maybe 10 or 15% of the instruction set.
The other 85 or 90 percent are just a confusion.. With no need to study or remember.
Programming is about Extract, Sort and Report.
and keeping it simple stupid.
Flowcharting would have solved the Max 8 software bug. (By playing computer)
Don't think do..


----------



## R-T-B (Jan 19, 2021)

So my mod got optimized a little:
BEFORE:





AFTER:




Yes, those are FPS counters...  oof.  What did I learn today?

It turns out even draw distance cullers should be rate limited, or your entire CPU will be dedicated to deciding what to cull.  That is all.


----------



## dragontamer5788 (Jan 19, 2021)

R-T-B said:


> It turns out even draw distance cullers should be rate limited, or your entire CPU will be dedicated to deciding what to cull. That is all.



Code too busy optimizing... so you can't run faster at all.

What does your mod do exactly? I don't play KSP, but I'd be interested in hearing more about your project. Despite talking about GPUs all the time, I don't actually do graphics programming.


----------



## R-T-B (Jan 19, 2021)

dragontamer5788 said:


> Code too busy optimizing... so you can't run faster at all.
> 
> What does your mod do exactly? I don't play KSP, but I'd be interested in hearing more about your project. Despite talking about GPUs all the time, I don't actually do graphics programming.


It's a solar system replacer plugin.  Doesn't do anything on it's own but allows config based solar system replacement in Kerbal Space Program.

I'm not the creator mind, just the only remaining maintainer.  It's pretty big in scope.  It pretty much rejects the stock solar system, and then proceeds to rewrite everything relating to it in a dll file that is used instead, allowing us to mod it using text config files.  We also implement support for other things stock KSP doesn't have such as multistar (binary, trinary, etc) systems etc and the related solar flux logic.

The mod is Kopernicus:









						GitHub - Kopernicus/Kopernicus: Kopernicus is a mod for Kerbal Space Program which allows users to replace the planetary system used by the game.
					

Kopernicus is a mod for Kerbal Space Program which allows users to replace the planetary system used by the game. - GitHub - Kopernicus/Kopernicus: Kopernicus is a mod for Kerbal Space Program whic...




					github.com
				




Beta branch supports more versions:









						GitHub - R-T-B/Kopernicus: Kopernicus is a mod for Kerbal Space Program which allows users to replace the planetary system used by the game.
					

Kopernicus is a mod for Kerbal Space Program which allows users to replace the planetary system used by the game. - GitHub - R-T-B/Kopernicus: Kopernicus is a mod for Kerbal Space Program which all...




					github.com
				




This amounts to rewriting about 25% of the game, lol.


----------



## dragontamer5788 (Jan 25, 2021)

__





						Tri-Color Garbage Collector [sean.cm]
					






					sean.cm
				




With my day-job finally slowing down, I'm able to start playing with code again at home. My last problem was GPU-memory management, so I've begun to research memory management algorithms. I'm sure people have already studied malloc / new, but... I actually think my use-case is a potentially useful situation for Mark-and-Sweep "Stop the World" garbage collectors.

In any case, the above garbage-collector blogpost was very useful at explaining mark-and-sweep and the tri-color invariant.


----------



## dragontamer5788 (Jan 31, 2021)

Uniprocessor garbage collection techniques
					

We survey basic garbage collection algorithms, and variations such as incremental and generational collection. The basic algorithms include reference counting, mark-sweep, mark-compact, copying, and treadmill collection. Incremental techniques can keep garbage...




					link.springer.com
				




There are PDFs of this online, but I'm not sure about their legality. I have an ACM subscription and can easily get the legal version. If you really wanted to, you'd find the PDFs rather quickly 

So there are about 4 styles of garbage collection common in 1992: Reference Counting, Mark-Sweep, Mark-Compact, and Copying. Reference Counting wasn't popular in `92, but is now popular in Rust, C++ (shared_ptr<>) and Python. Copying collectors seem to be the most popular in Java / C# these days (but with many intricate optimizations). Mark-Sweep is popular in Go.

Most of the complexities in GC seem to be in concurrent / realtime implementations. "Stop the World" is bad for most applications, so most modern algorithms hide the "Stop the World" in a concurrent thread, requiring multiple-read / single writer or multiple-read / multiple-writer concurrency (depending on the specific style chosen). As noted earlier: I plan to avoid these complications by simply embracing "Stop the World".

----------

In the GPU-world, you execute a kernel: spawning many blocks (composed of dozens of SIMD-threads). All blocks then finish execution as the kernel cleans up. That shares many similarities to 'Stop the World', once the kernel is finished, you know that no GPU-thread is executing on the data anymore, and that its safe to garbage collect.

However, I now have an issue: "Stop the World" collectors stop the world when they run out of memory on a "malloc" call. Since no memory is available, "all threads stop", and garbage collection begins. This is very useful in VMs where you can just "save state" (Java-registers or C# Bytecode), but GPU Kernels can't be interrupted like that. That leaves me with two choices:

1. Create a VM-like "save state" for all GPU threads: allow a kernel to "exit early" on any malloc() call as the malloc may not find enough memory (without a garbage collect anyway). In effect: GPU kernels will have 2 return codes: #1: malloc() failed, so please run garbage collect and rerun the kernel afterwards. Or #2: computation is finished, data is in the final state.

2. Predict the number of mallocs and garbage collect "earlier".

Neither seems easy. But both approaches seem efficient if I implement it correctly. #1 has similarities to continuations and other low-level details. The split between "VM" and 'Physical SIMD Thread' could be useful for other reasons. #2 is a more direct way to solve the problem, but I'm not entirely sure how to "predict the future". If I knew how much the threads were going to malloc, I wouldn't need a dynamic malloc call to begin with!

I guess that forces me to use #1, unless there's another approach...

EDIT: I just thought of #2: the AMB operator (http://www.randomhacks.net/2005/10/11/amb-operator/). You don't "guess", you actually execute. When a malloc() fails, you backtrack (just like the AMB operator) to the nearest savestate. Then, garbage collect, and then resume. That might actually be easier than doing #1 (which is basically creating savestates)... #2 you can have a simple checkpointing system at easy-to-define points.

----------

Section 6 was the most informative: discussing low-level details like how to lay out the headers and some CPU-time / Space tradeoffs. Because I plan to use a GPU: I'll have *tons* of CPU-time but very little memory-bandwidth available. More CPU-heavy approaches that use less space probably would be beneficial.

-----------

Why am I interested in GC on a GPU? Well, simple really. The malloc() and free() calls are going to be called in *severely* multithreaded situations. Dozens of concurrent mallocs() and/or frees() going on simultaneously. GC solves this concurrency problem by making a separate stage where the garbage is collected... this separate stage has better concurrency guarantees than when the code is running on its own.


----------



## R-T-B (Feb 7, 2021)

I'll take "stuff the decompiler spit out" for $250, Trebek.




See all those reciprocals?  Yeah, they simplify out to basically x * y, but it thought it was better to state it as 1 / ((1 / x) * (1 / y)).  God that's weird.


----------



## dragontamer5788 (Feb 8, 2021)

R-T-B said:


> I'll take "stuff the decompiler spit out" for $250, Trebek.
> 
> View attachment 187366
> See all those reciprocals?  Yeah, they simplify out to basically x * y, but it thought it was better to state it as 1 / ((1 / x) * (1 / y)).  God that's weird.



I'm going to bet that SP.flowRate is often used as a rate, and therefore is probably being stored in reciprocal form.

You have a ton of "foo/flowRate" all over the code, to the point where the compiler has chosen to store 1/flowrate instead of flowrate.


----------



## R-T-B (Feb 8, 2021)

dragontamer5788 said:


> I'm going to bet that SP.flowRate is often used as a rate, and therefore is probably being stored in reciprocal form.
> 
> You have a ton of "foo/flowRate" all over the code, to the point where the compiler has chosen to store 1/flowrate instead of flowrate.


Probably.  It's not my code, but that is the likely origin.


----------



## dragontamer5788 (Feb 8, 2021)

R-T-B said:


> Probably.  It's not my code, but that is the likely origin.



I'm not an expert on floating-point arithmetic. But 1/sqrt(x) might be faster than sqrt(x). (And a bunch of other operations in the reciprocal). Something to do with speed of convergence, Newton's Method, error bounds and the like.

So my alternative answer is... "sqrt()" is somewhere in the code, and the compiler has decided that keeping things in reciprocal form for a slightly faster 1/sqrt() is more efficient than doing the normal sqrt() operations. Reciprocals aren't fast, but they aren't slow either. So its possible that 1/blah all over the place (and then doing 1/sqrt() in various locations) is superior over using sqrt(). If not faster, then possibly a lower error bound or something along those lines.

Yeah, floating point math is complicated and sucks.


----------



## R-T-B (Feb 8, 2021)

boy, it's part of an astronomy sim, so you don't need to tell me floating point sucks.  Single precision losses are unbearable there, doubles are even close to their limit at times...


----------



## dragontamer5788 (Feb 8, 2021)

R-T-B said:


> boy, it's part of an astronomy sim, so you don't need to tell me floating point sucks.  Single precision losses are unbearable there, doubles are even close to their limit at times...



Soooooo.... what do you think of using FP16 or BFloat16 for the size-fields of my memory manager (with garbage collection)?



Seriously though: size-classes in memory management is pretty arbitrary. Google's tcmalloc implementation has a very arbitrary list:









						google/tcmalloc
					

Contribute to google/tcmalloc development by creating an account on GitHub.




					github.com
				





```
const SizeClassInfo SizeMap::kExperimentalSizeClasses[SizeMap::kExperimentalSizeClassesCount] = {
    // <bytes>, <pages>, <batch size>    <fixed>
    {        0,       0,           0},  // +Inf%
    {        8,       1,          32},  // 0.59%
    {       16,       1,          32},  // 0.59%
    {       24,       1,          32},  // 0.68%
    {       32,       1,          32},  // 0.59%
    {       40,       1,          32},  // 0.98%
    {       48,       1,          32},  // 0.98%
    {       64,       1,          32},  // 0.59%
    {       72,       1,          32},  // 1.28%
    {       80,       1,          32},  // 0.98%
    {       88,       1,          32},  // 0.68%
    {       96,       1,          32},  // 0.98%
    {      104,       1,          32},  // 1.58%
    {      112,       1,          32},  // 0.78%
    {      120,       1,          32},  // 0.98%
    {      128,       1,          32},  // 0.59%
    {      136,       1,          32},  // 0.98%

SNIP


    {    13568,       5,           4},  // 0.75%
    {    14336,       7,           4},  // 0.08%
    {    16384,       2,           4},  // 0.29%
    {    20480,       5,           3},  // 0.12%
    {    24576,       3,           2},  // 0.20%
    {    28672,       7,           2},  // 0.08%
    {    32768,       4,           2},  // 0.15%
    {    40960,       5,           2},  // 0.12%
    {    49152,       6,           2},  // 0.10%
    {    57344,       7,           2},  // 0.08%
    {    65536,       8,           2},  // 0.07%
    {    73728,       9,           2},  // 0.07%
    {    81920,      10,           2},  // 0.06%
    {    90112,      11,           2},  // 0.05%
    {    98304,      12,           2},  // 0.05%
    {   114688,      14,           2},  // 0.04%
    {   131072,      16,           2},  // 0.04%
    {   147456,      18,           2},  // 0.03%
    {   163840,      20,           2},  // 0.03%
    {   180224,      22,           2},  // 0.03%
    {   204800,      25,           2},  // 0.02%
    {   229376,      28,           2},  // 0.02%
    {   262144,      32,           2},  // 0.02%
```

Etc. etc. You get the gist. 86 size classes, with a non-obvious difference between them (kinda-sorta exponential... but mostly +8 early on)

Anyway, why not formalize this into exactly 256-size classes by making an 8-bit floating point number that traverses the necessary size classes? 256kB is the biggest size class, so really a 4-bit exponent + 4-bit mantissa could represent all the numbers between 8 to 256kB in just a single 8-bit number.

Standard "2^exponent * 1.mantissa" that half/float/double all do, except fractions are not needed (so 2^exponent is literal. No need to do a "minus 15" or "minus 2048" exponent bias or whatever crazy stuff happens in halfs / float / double). 2^exponent literally is all you need (though exponent = 0 / mantissa = 0 results in a number of 1.0... ignore that little qualm, lol).

So yeah, 32-bit floats and 64-bit floats are not precise enough for astronomy. But 8-bit floats (4exponent + 4mantissa) might be precise enough for memory management. Think about it...

You want to "round" similar sizes together for the sake of fragmentation. If there's a 20480 byte chunk available, but the user asks for 19000 bytes, you give them the 20480-byte chunk.


----------



## R-T-B (Feb 9, 2021)

dragontamer5788 said:


> Soooooo.... what do you think of using FP16 or BFloat16 for the size-fields of my memory manager (with garbage collection)?


I'm not the programmer I used to be.  I use what C# or java throws at me and call it a day, lol


----------



## Aquinus (Feb 9, 2021)

I like not thinking about these things if they can be avoided so I can focus on solving the business domain problem as opposed to the comp. sci. problem. I use Clojure because it's easier to implement solutions and check correctness, not because it's necessarily the most computationally efficient option. For example, it's a lot easier for me to reason about business logic with a language that's referentially transparent, because I care about checking equity of values, not references to memory locations. Same deal with number precision. If an integer overflows, it's nice to have a language that will move it into being a long, then a BigInteger (in JVM land.) Not to mention that the JVM runs practically everywhere.


----------



## dragontamer5788 (Feb 9, 2021)

A few things.

1. GPUs do have a "GPU Malloc" call (erm... cudaMalloc), but it comes with a number of disadvantages. Most importantly: its globally synchronous. So your 4000+ cuda-threads on the GPU grind to a halt whenever they come across a default cudaMalloc call.

2. #1 can be avoided by simply issuing cudaMalloc at the beginning of your code's execution, and then manually splitting up memory inside of your kernels (a kernel is basically a function-call that runs in your GPU instead of the CPU). However, this requires a memory-management layer inside of your kernels.

3. GPUs are extremely parallel. The smallest unit on  NVidia is a 32-thread *warp*, the GPU itself always executes 32-threads at a time. (Or: think of it as a 32-wide SIMD, like how AVX is 8 x 32-bits, GPUs are 32x32-bits). This actually makes GPUs extremely efficient in a variety of algorithms... most of which haven't been discussed in the mainstream programming journals (!!).

4. For example: I believe that GPUs can implement Cheney's semispace collector algorithm very efficiently. All 32-threads get up to 32x objects, and performs a breadth-first-search across them. All 32-breadth-first-searches can be accomplished in parallel (subject to a few obvious corner cases: such as when one object is visited by two or more threads at the same time). These corner cases can be resolved in __shared__ memory extremely efficiently: using a GPU-sorting network (https://nvlabs.github.io/cub/structcub_1_1_device_radix_sort.html). Heck, it probably can be solved with a semi-sort even more efficiently (https://dl.acm.org/doi/10.1145/2755573.2755597).

Its clear that Cheney's semispace allocator / collector is a clear candidate for GPU-parallelism. (https://en.wikipedia.org/wiki/Cheney's_algorithm). Aka: a form of garbage collection ("true" garbage collection to boot). It requires knowing the latest-and-greatest GPU algorithms to implement efficiently, but those algorithms are already in libraries like CUB.


----------



## dragontamer5788 (Jul 2, 2021)

So Knuth's "Volume 4" (which is now something like 7 books long lol) of "Art of Computer Programming" is extremely relevant to the code I'm trying to write here.

I had Fascicle 6 (SAT Solvers), but its really specific to the SAT problem. Which... is pretty general but its not how I want to program things. Volume 4A is an excellent introduction to combinatorics: exhaustively giving many ideas into n-tuples, permutations, combinations, partitions, and trees.

However, Volume 4 Fascicle 5: "Backtrack Programming / Dancing Links" has been the most relevant piece of work from Knuth with regards to my personal project here. I know this sounds stupid simple: but the gist is to have a "state" (where the "state" is some kind of advanced data-structure that represents the search space + the associated data structures needed to quickly perform operations in this search space), and then a "do" and "undo" actions.

When you move to explore the space, you push "undo" onto a stack. When you backtrack, you pop from the stack and carry out the undo action.

Here's the key: memory operations on modern computers are extremely expensive. (If you didn't know: reading/writing to DDR4 on a modern computer is 50 nanoseconds, or 200 cycles). So you want to do these operations with as few "mems" as possible (where a "mem" is a main-memory read or main-memory write). It is obvious that you can push your entire data-structure onto the stack as you explore the space... but if your data-structure of state is even just a few hundred bytes long, you're unnecessarily writing (and later: reading back) memory.

In practice, its far cheaper to push / pop representative actions. Knuth early in the book pushes/pops memory-reads and writes.

For example: if mem[5000] = 20, Knuth writes into the stack push(mem[5000]); push(5000) and then mem[5000] = 20. Later, mem(pop()) = pop() to undo the action.

If your data-structure is 5000-bytes long, then you'd originally have to push 5000 bytes onto the stack. But by only pushing the "changes" of the datastructure onto the stack, you can often save on mems (especially because of how much compute you can do today per mem: over 200 cycles... its probably worth the tradeoff).

-------------

"Dancing Links" is an even more advanced methodology where a linked list is used for this backtracking. For the problem of the exact cover problem (a famous NP-complete problem), it turns out you can perform the do and undo operations on a linked list without a stack and without any dynamic memory at all (!!!!). Yes, its a statically allocated linked list. Its... difficult for me to explain, really just best to read the book if you're interested. But its brilliant IMO.

Even if I don't plan on using the "Dancing Links" technique, I'm glad to have learned it.

------------

The rest of the book is about using Dancing Links in a SAT-solver like way. Knuth generalizes the exact cover problem into "exact cover with colors", and then shows how many puzzle / NP-complete toy problems are easily translated into the "exact cover with colors" problem. Because dancing links is so efficient and general purpose... and because of the nature of NP complete problems (any NP complete problem is provably "convertable" to other NP complete problems in Polynomial time), the dancing links methodology can be used in the "general case" to solve many puzzles. At least half of the text is on techniques and examples for how to "shove" various problems into "exact cover with colors".


----------



## dragontamer5788 (Sep 24, 2021)

I'm reading Volume 4 backwards, lol.

Volume 4 Fascicle 5 "Backtracking" was very inspiring, but a lot of concepts were built up from stuff earlier in the "volume". So I went back to reading through Volume 4A. The section on BDDs (binary decision diagrams) was incredible. Its an amazing data-structure, and Knuth is right to hype it. Doing more research on the subject led me to MDDs (multi-valued decision diagrams), the relational-equivalent to BDDs. Without getting too technical here, a BDD is a highly compressed representation of a binary-truth table. An MDD is the same format applied to a relational-table: that is, extending the concept out from 0s and 1s into 0s, 1s, 2s, 3s... etc. etc. (for as many values as you need).

All MDDs could be represented as a BDD, but MDDs are more natural in some problems. Kinda like 3SAT vs Constraint Programming: 3SAT is provably equivalent but some things are just easier to understand in the constraint-programming model instead. Or maybe Binary Tree vs B-Tree: the Binary Tree is basically the same as a B-Tree but in practice, the fewer number of "pointer" links in the B-Tree can lead to more efficiency for some problems.

Alas: Knuth does not talk about MDDs. I have a feeling that MDDs are a "simple extension", but... rather than risking it, I'm buying another textbook on the subject of BDDs specifically.

--------

If you ever need to represent boolean truth tables with a moderate number of variables (ex: 128-bit truth table, such as the "35th bit of a 64x64 bit multiplication circuit"), BDDs seem to be the answer. (And in the case of multiplication, with 64-bits worth of outputs, you'd use 64x BDDs as a collective). In practice, BDDs are the data-structure used for CPU-verification algorithms, proving that multiplication circuits are equivalent to what's expected. There's many, many BDDs-like data structures: Knuth also has a section on ZDDs (Zero-suppressed Decision Diagrams), a more recent invention in 1993, but otherwise similar in concept to classic BDDs. Many researchers are putting their own spin on things.


----------

