Tuesday, August 8th 2017
AMD Confirms Ryzen Marginality Performance Issue Under Linux, TR and EPYC Clear
An issue on AMD's Ryzen performance under certain Linux workloads, which caused segmentation faults in very heavy, continuous workloads on the Ryzen silicon (parallel compilation workloads in particular) has been confirmed by AMD. Tests like Phoronix's Test Suite's stress run quickly bring the Ryzen processors to their knees with multiple segmentation faults. While this problem is easy to cause under very heavy workloads, the issue is virtually absent under normal Linux desktop workloads and benchmarking,
AMD also confirmed this issue is not present in EPYC or Threadripper processors, but are isolated to early Ryzen samples under Linux (AMD's testing under Windows has found no such behavior.) AMD's analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor, but are problems with the processors themselves. AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux. AMD will also be stepping up their Linux testing/QA for future consumer products.
Sources:
Phoronix, AMD Confirms Ryzen Issue - Phoronix
AMD also confirmed this issue is not present in EPYC or Threadripper processors, but are isolated to early Ryzen samples under Linux (AMD's testing under Windows has found no such behavior.) AMD's analysis has also found that these Ryzen segmentation faults aren't isolated to a particular motherboard vendor, but are problems with the processors themselves. AMD encourages Ryzen customers who believe to be affected by the problem to contact AMD Customer Care. Some of those who have contacted customer care about the segmentation faults have in turn been affected by thermal, power, or other problems, but AMD says they are committed to working with those encountering this performance marginality issue under Linux. AMD will also be stepping up their Linux testing/QA for future consumer products.
45 Comments on AMD Confirms Ryzen Marginality Performance Issue Under Linux, TR and EPYC Clear
hehe
just for this specific test or if Ryzen also gets loaded with the same amount of workload?
It's a huge problem - just like with the FMA before.
And as people are pointing out on CPU-specific forums - it most likely can be fixed, but each fix of this sort takes a bit of performance. So will AMD address this? Especially after they called it "marginal"?
BTW: this is not specific to Linux. The same problem should happen under Windows at this kind of load. It's just that people normally don't use Windows for such tasks.
Good news: no one has succeeded in replicating this issue on an EPYC system, yet. It could happen in a similar workload as well, not "same amount".
source
I'm waiting for 3rd party tests. It would be great if Phoronix did it (one of the last proper CPU-testing websites), but I doubt they would do a Windows-based review. :-(
I have to say... I might have got a Ryzen if AMD had a bug bounty programme. This could be more profitable than mining crypto. :-P
BTW: Intel and Apple started bug bounty programs... maybe they feel more secure with their products...
This is more likely due to the fact linux tends to be more "low level" in hardware init than windows. Less is left to the bios. That said, it does explain why my gentoo install constantly segfaults on compile. I thought a higher SOC voltage alleviated this, but we shall see.
you are free to assume anything.
The only known way to reliably reproduce this is on linux (github.com/suaefar/ryzen-test). If AMD did not acknowledge they identified the issue, I'm having a hard time they know for certain Windows, Epyc or Threadripper are not affected. Their internal testing came up empty so far, but like pointed above, their internal testing failed to spot the problem on Linux as well until someone from outside stepped in and pinpointed it for them.
Keep in mind there's still the possibility this is not wide spread and can be fixed with a firmware update. But until AMD identifies the issue, we just don't know.
Also, "AMD also confirmed this issue is not present in EPYC or ThreadRipper processors, but are isolated to early Ryzen processors under Linux" seems to signify it is an issue with early batches of silicon. Worst case, AMD could simply replace those chips, as it sounds like the issue was already fixed on newer ryzens.
Also, when they tell me about "performance marginality", it seems more like a big FU to users than AMD "already in the process of taking care of it". Wth is "performance marginality"?
It doesn't even say "help" like in "update firmware" or "replace CPU".
It says "work with". That's so cute of them!
What I'd expect is a proof that it doesn't affect EPYC, not confirmation. Most likely "fixing this will eat 5% of Ryzen performance, so no way".
from the same source
if TR & Epyc also affected, why would AMD send hardware for them to test, eh?
AMD stated TR / Epyc are not stated. Do you have any source that claim otherwise? Or should we all fall the same under your assumption? :banghead:
The issue was reported over at AMD's forums 3 months ago, the BSD and Gentoo communities, and of course lately Phoronix. Our friends over there have done some extensive debugging.
There are at least two distinct symptoms:
1 "Segfaults" - Under load pointers may get corrupted, which results in undefined behavior. This is why compilation fails "randomly".
2 uOP cache errors
The problems have nothing to do with Linux. Linux is a kernel, but the problems are reproduced on Linux, BSD and Windows Subsystem for Linux (WSL) (which runs on the Windows kernel). Both gcc and llvm are tested, the problems have been reproduced during compilation of gcc, mesa, chromium, thunderbird, libreoffice, ffmpeg, linux kernel, bsd kernel and more. Memory configurations and timings have been eliminated as a cause.
Corruption of pointers and instructions is a hardware defect. It remains unknown if the two distinct symptoms are caused by a single bug, or by two unrelated bugs. The bug is present for all Ryzen chips of the B1 stepping (even brand new ones). The bug(s) are not specific to compiler workloads, the symptoms occur under stress, which is most easily reproduced with heavy compilation tasks. The bug(s) will cause "random" application and system instability. Other tasks, such as Prime95 or Cinebench does not run into these problems, since these stresses different parts of the CPU. It seems like an internal synchronization issue in the prefetcher, resulting in undefined behavior when certain conditions apply.
A proper solution would require a new stepping. Hopefully the existing Ryzen parts can eliminate the problems through a firmware update, which of course might cause a performance penalty. Users have tried to disable the uOP cache and/or SMT, etc. , which seems to reduce the symptoms but not eliminate them. As long as this remains unsolved, people should postpone buying these chips for workstation/productive workloads. Note, there is so far no clear indication that this bug poses problems for games, so it's quite possible that the chips are "stable enough" for certain workloads.
While AMD have had reports of this since early May (and some indications in April), it's possible that they've thought of this as an obscure Linux bug and not prioritized this during the summer vacation. But the evidence is now mounting, some users have gotten several new chips through RMA and the problem is still there.
Maybe a V bump can fix it like the last bug fixed with microcode.
It just seems to me the community likes to watch fires and only call the fireman for some of them. Blaming there lack of response on the rest.
AMD is no saint. Lots of chip manufacturers fix errata silently. Thats a big part of what BIOS updates are. but BIOS updates that contain fixes arent headlined on CNN or tech forums around the globe.
Just wondering why its so easy for you to hate the way a company fixed an issue they never got an official report on, and later fixed; only to be chastised for not paying "attention". But AMD on the other hand gets an official complaint looks into it and reports on it and now they are saints. I'm confused AMD has fixed errata on there own accord without publicity. Shouldnt they also be target for your wrath?
There's a double standard here alright and it's not from me or to AMD's benefit.
www.google.com/search?q=Intel+errata&oq=Intel+errata&aqs=chrome..69i57j0l5.1823j0j7&sourceid=chrome&ie=UTF-8
They post there errata PDFs publicly.
www.intel.com/content/www/us/en/search.html?toplevelcategory=none&query=errata&keyword=errata&:cq_csrf_token=undefined
Edited b/c I can't English properly, murrica.
I thought you were slipping stud reg date of 07 I just couldnt grasp that you may have lost your touch for facts like the new users have.