…
You're probably asking now, "what if the event never gets created?" Exactly, your program will be hung, forever, caught in an infinite loop. The correct way to implement this code is to either set a time limit for how long the loop should run, or count the number of runs and give up after 100, 1000, 1 million, you pick a number—but it's important to set a reasonable limit.
While I do commend you for taking the time to debug this, I have to point out that the real problem is whatever is causing the driver to end up in an invalid state. The problem you have spotted is really only the symptom, not the cause. While dealing with system calls may need some handling, it's much more important to catch the invalid state before this system call, probably with an assertion, so they would actually catch this bug during QA (assuming they do QA before shipping).
Just making the code more "forgiving" would just suppress the underlying problem, and depending on the surrounding code it may sometimes actually not be a good idea. I've seen many developers chase "endless" streams of bugs because they just suppress them.
Waiting on synchronization signals is very basic programming skills, most midterm students would be able to implement it correctly. That's why I'm so surprised to see such low quality code in a graphics driver component that get installed on hundreds of millions of computers.
Respectfully disagree.
While 100-line textbook examples are easy to manage, working on synchronization and threading in larger code bases is an expert level skill. Synchronization is also very tough to validate, and bugs may be hard to reproduce, especially when dealing with issues on the scale of microseconds or nanoseconds.
I'm not surprised these drivers are full of "glaring mistakes". In real life code is often stitched together under tight deadlines, bugs are swept underneath the rug, and workarounds are favored over
proper rewrites, often by management who thinks "it can be fixed after the deadline". I've witnessed some major screw-ups, such as relying on a completely defective mutex implementation in some military stuff for >20 years… I just hope my loved ones are nowhere near when s*** hits the fan…
Modern software development techniques avoid these mistakes by code reviews—one or multiple colleagues read your source code and point out potential issues.
They certainly
should, but in reality it's fairly uncommon for someone doing QA to read every line and test every possible outcome.
But I believe critical code should strive for this, such as kernels, drivers, firmware, etc.
There's also "unit testing", which requires developers to write testing code that's separate from the main code. These unit tests can then be executed automatically to measure "code coverage"—how many percent of the program code are verified to be correct through the use of unit tests. Let's just hope AMD fixes this bug, it should be trivial.
Unit tests should be a part of many project's toolchains, but only a small part, as unit tests only can cover a tiny portion of potential problems. Unit tests tests a unit (function, class etc.) in a vacuum, problems such as the one described above are outside the scope of a unit test.
But don't mention code coverage, that gives me chills. A completely useless metric which misleads developers into thinking code is actually tested
-----
I still enjoyed the article though Wizz, tech websites needs more of this. Even deeper stuff if possible; the deeper the better
Have a nice evening.
Maybe the compiler or something should be stricter and spit the code back if its 'lazy'
If we ever get compilers able to detect stupidity, we would probably not need programmers anymore.