Thursday, November 28th 2019
AMD Radeon "Navi" OpenCL Bug Makes it Unfit for SETI@Home
A bug with the Radeon RX 5700-series "Navi" OpenCL compute API ICD (installable client driver) is causing the GPUs to crunch incorrect results for distributed compute project SETI@Home. Since there are "many" Navi GPUs crunching the project cross-validating each others' incorrect results, the large volume of incorrect results are able to beat the platform's algorithm and passing statistical validation, "polluting" the SETI@Home database. Some volunteers at the SETI@Home forums, where the the issue is being discussed, advocate banning or limiting results from contributors using these GPUs, until AMD comes out with a fix for its OpenCL driver. SETI@Home is a distributed computing project run by SETI (Search for Extraterrestrial Intelligence), tapping into volunteers' compute power to make sense of radio waves from space.
Sources:
SETI@Home Forums, AMD Community, TH1813254617 (Reddit)
32 Comments on AMD Radeon "Navi" OpenCL Bug Makes it Unfit for SETI@Home
SETI@Home is fun and all, but this is a general problem in OpenCL. There's a suggestion that Navi has bad FFT implementation.
So as of this moment Navi cards are unfit for almost all computing production systems... and rather pointless for development (even students).
And this shows up basically a week after W5700 launch.
Fun stuff.
Wow
Fourier Transform is one of the fundamentals for compute work. If AMD indeed screwed up its implementation at the hardware level, they would need a recall.
2080TIs already found space invaders.
That's how science works. If most people on Earth do an experiment incorrectly, the bad result becomes statistically relevant (as in: not an obvious outlier).
There's no way to test this other than perform a different experiment of the same phenomenon.
In fact, that's why we're able to notice these issues in computational science.
There are different libraries that do equivalent math. And there are different CPUs and GPUs that we can compare.
If Navi was doing some computation incorrectly, but no other hardware was used, there would be no way to test for this error.
the only way to test correctly is to use other hardware, iirc they try to not send the validation to a similar system. They wont just discard the data, they'll save it to send it out again. I do agree they should suspend the 5700s for the time being.
This is the case for almost all “Open Standard” computation acceleration framework. Not a lot of researchers like to invest their money and human resources into such things due to fear of being ripped off by bigger fish since everything published will be fair game to use. It is a damn shame though. OpenCL would have been a great alternative to CUDA.
They probably prioritized fixing the random crashes in the drivers first before concentrating on GPGPU stuff.
Of course this can be fixed in software. Let's hope there will not be any performance penalty, because what would that mean for all the Navi supercomputers ordered? :D
Vega is AMD's compute card atm. Arcturus is coming compute card, which is more similar to Vega than Navi.
This problem was noticed in one of them because gamers already started using Navi (card for scientists/engineers was just announced and isn't used yet).
A GPU doesn't have a "calculate Seti@home" that doesn't work (while "calculate Einstein@home" does).
It makes errors in some math instruction that Einstein@home may not use. That's it.
As mentioned earlier: there's a possibility that FFT results are incorrect. FFT (Fast Fourier Transform) is a fundamental algorithm used for many problems. So the card is already almost useless for computing.
And another thing is about being reliable. It's obvious that AMD haven't properly tested this card, so there's really no reason to believe in other results. Everything will have to be tested by the clients... and there goes the "value".
I doubt there is anything wrong with the card or GPU it's probably a simple issue where the driver applies compression to data that shouldn't be compressed with the same algorithm that compresses color data.