Thursday, March 16th 2017

AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions

An AMD Ryzen 7-1800X powered machine was found to be crashing upon execution of a very specific set of FMA3 instructions by Flops version 2, a simple open-source CPU benchmark by Alexander "Mystical" Yee. An important point to note here is that this little known benchmark has been tailored by its developer to be highly specific to the CPU micro-architecture, with separate binaries for each major x64 architecture (eg: Bulldozer, Sandy Bridge, Haswell, Skylake, etc.), and as such the GitHub repository does not have a "Zen" specific binary.

Members of the HWBot forums found that Ryzen powered machines crash on running the Haswell-specific binary, at "Single-Precision - 128-bit FMA3 - Fused Multiply Add." The Haswell-specific binary (along with, we imagine, Skylake), adds support for the FMA3 instruction-set, which Ryzen supports, and which lends some importance to the discovery of this bug. What also makes this important is because a simple application, running at user privileges (i.e. lacking special super-user/admin privileges), has the ability to crash the machine. Such a code could even be executed through virtual machines, and poses a security issue, with implications for AMD's upcoming "Naples" enterprise processor launch.
Add your own comment

62 Comments on AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions

#26
silentbogo
phanbueysounds patcheable
FMA4 for Zen is long-disabled in binutils, as probably in MS Visual Studio.
Also, it is confirmed that w/o SMT the benchmark is running fine, so the problem is not FMA, but once again - SMT conflicts.
...khm-khm... OpenMP... khm-khm....
Posted on Reply
#27
BiggieShady
R0H1TSomeone running Naples will likely have their own application coded to run on the Ryzen server, they don't just copy/paste the aforementioned code to run on their application & crash (test) a server.
Nope, today it's all about virtualization and the cloud ... in that case single Naples server in the server farm is hosting multiple VMs that different business use for various public online services ... it's enough that only one of these businesses allow their users to store something executable on the host and after one malicious (or unlucky) user - bam, all VMs on the node are down.

Amazon Cloud Node -> N x Business -> N x M x EndUsers
Posted on Reply
#28
Imsochobo
btarunrNo, my point is the disgruntled IT guy Barclay's just fired could crash a "Naples" powered server with just this "little known program."
I can crash our intel servers easily with some code.
Nothing new here, move along...
Posted on Reply
#30
OneCool
We support it!!!

Oh..But don't use it lol
Posted on Reply
#31
dorsetknob
"YOUR RMA REQUEST IS CON-REFUSED"
silentbogoTesla Model S won't run on Diesel!
They slide on diesel tho :)
Posted on Reply
#32
Gasaraki
RejZoRIf this benchmark things are tailored to such specific level that they differentiate even SERIES within SAME VENDOR, why the hell is this a news?
"this important is because a simple application, running at user privileges (i.e. lacking special super-user/admin privileges), has the ability to crash the machine."
Posted on Reply
#33
XiGMAKiD
dorsetknobThey slide on diesel tho :)
The road ahead looks shiny, I wonder wh.. GAAAAAH!!!
Posted on Reply
#34
erek
Jack1nIt's funny how people seem to be missing the point in this article, anyway, I hope AMD is able to fix this.
Agreed, the whole notion of it being exploitable at least seems to be missing in their thoughts
Posted on Reply
#35
OSdevr
I've been lurking here for years but I feel the need to say something.

Guys, this IS a big deal. As others have noted an unknown instruction is supposed to raise an "Undefined Opcode" exception, something that predates even 16-bit protected mode. On CPUs which offer 'User' and 'Kernel' mode (ie everything since the mid 80s) the exception is handled by the operating system, which usually just kills off the process. The whole idea of User mode is that no User mode program can screw with the system without 'permission' from the OS.

This is similar to the Cyrix coma bug or the Pentium F00F bug. However I agree that this can probably be fixed in microcode.
Posted on Reply
#36
Steevo
Really guys?

Lets look at the Intel 7700K errata list.

"
Revision
Description
Date
001
Initial release
August 2016
002
• Errata
 Added errata KBL068-078
 Updated erratum KBL062
 Fixed erratum KBL063
November 2016
003
• Added SKUs Y/U w/iHDCP2.2, S/H-Processor lines
• Added Table 2, S/H-Processor Lines Component Identification
• Identification Information
 Added Table 4, Y-Processor Line With iHDCP2.2
 Added Table 6, U-Processor Line With iHDCP2.2
 Added Figure 3, S-Processor Line LGA Top-Side Markings
 Added Table 7, S-Processor Line
 Added Figure 4, H-Processor Line BGA Top-Side Markings
 Added Table 8, H-Processor Line
• Errata
 Updated Table 13, Errata Summary Table
 Added errata KBL079-083
January 2017
004
• Identification Information
 Updated Table 4, Y-Processor Line With iHDCP2.2
• Errata
 Updated Table 13, Errata Summary Table. Added J-1 stepping
 Updated KBL080
 Added errata KBL084-091
February 2017
§"

All processors have flaws, and a future stepping, or even current stepping with an update to microcode.

Big deal if left unpatched or unfixed? Yep. Will it be fixed? Yep.
Posted on Reply
#37
silentbogo
OSdevrThe whole idea of User mode is that no User mode program can screw with the system without 'permission' from the OS.
Regardless of your suggestive nickname, I assume you've never played pranks on your co-workers with NtRaiseHardError, or dumb overflow vulnerabilities.
There are dozens of ways you can hang, BSOD, mess up your machine from userspace.

TO EVERYONE:
It's not even known or clear, whether the bug pertains to FMA instructions at all. It was only assumed, because benchmark BSODed on the FMA3 256bit benchmark stage, and only with SMT enabled.
The reason could be anything, from Windows bug, or libgomp bug, or SMT on Zen itself, or some other unknown factor.
Let's not jump to any conclusions before even knowing what the problem is.
Posted on Reply
#38
OSdevr
silentbogoRegardless of your suggestive nickname, I assume you've never played pranks on your co-workers with NtRaiseHardError, or dumb overflow vulnerabilities.
There are dozens of ways you can hang, BSOD, mess up your machine from userspace.
That is why I put the word "permission" in quote marks.:) I consider those methods to be software bugs, the CPU itself isn't to blame (minus errata problems of course).

BTW I just now read the HWbot post. For some reason I thought it was a reset like a triple fault. The Coma and F00F bugs were a better analogy than I realized.

I actually have written a simple operating system, though I wouldn't recommend designing as you go like I did.
Posted on Reply
#40
Casecutter
btarunrWould a company like Barclay's put its client live database on a "Naples" machine now?
Are there any Naples servers running now with "live" client database? When there is it will be a problem, for now these enthusiast CPU just shut-down "crash" the system. Not a great option but better then the data being compromised. I'm sure this will be fixed especially when "Naples" sever equipment actually goes live.
Posted on Reply
#41
laszlo
no cpu is perfect as those who designed & produced them aren't also...neither the universe is not and nobody can understand or patch it...
Posted on Reply
#42
ensabrenoir
dorsetknobThey slide on diesel tho :)
..........absolute Genius!!!!!!!!! A hemi powered Tesla that runs on diesel must be created!!!!!!!!!!!!!!!
Posted on Reply
#43
dorsetknob
"YOUR RMA REQUEST IS CON-REFUSED"
ensabrenoirA hemi powered Tesla that runs on diesel must be created
A Welderup Rat Rod from the Vegas builder :) Twin turbo 1200hp Smoker
Posted on Reply
#44
OneCool
dorsetknobA Welderup Rat Rod from the Vegas builder :) Twin turbo 1200hp Smoker
And it still couldn't out run a stock Tesla :roll::laugh: :nutkick:
Posted on Reply
#45
kid41212003
SteevoReally guys?

Lets look at the Intel 7700K errata list.

"
Revision
Description
Date
001
Initial release
August 2016
002
• Errata
 Added errata KBL068-078
 Updated erratum KBL062
 Fixed erratum KBL063
November 2016
003
• Added SKUs Y/U w/iHDCP2.2, S/H-Processor lines
• Added Table 2, S/H-Processor Lines Component Identification
• Identification Information
 Added Table 4, Y-Processor Line With iHDCP2.2
 Added Table 6, U-Processor Line With iHDCP2.2
 Added Figure 3, S-Processor Line LGA Top-Side Markings
 Added Table 7, S-Processor Line
 Added Figure 4, H-Processor Line BGA Top-Side Markings
 Added Table 8, H-Processor Line
• Errata
 Updated Table 13, Errata Summary Table
 Added errata KBL079-083
January 2017
004
• Identification Information
 Updated Table 4, Y-Processor Line With iHDCP2.2
• Errata
 Updated Table 13, Errata Summary Table. Added J-1 stepping
 Updated KBL080
 Added errata KBL084-091
February 2017
§"

All processors have flaws, and a future stepping, or even current stepping with an update to microcode.

Big deal if left unpatched or unfixed? Yep. Will it be fixed? Yep.
What's your point? What are you trying to say? TPU is simply reporting the news. Is this serious if left unfixed? Yes. Should TPU just stop reporting stuffs? No.
Posted on Reply
#46
notb
I'm pretty amazed by the comments...

It seems most people really don't understand how this problem works - looking at all the comments saying that you can crash any system with some code (and the Tesla on diesel stuff as well...)

And because many of you have already said that this can be PROBABLY fixed by microcode, it's almost natural to ask a question: what if it can't be fixed? :) Any bets?

Either way, IMO this is another sign that there's something deeply wrong with Ryzen architecture (most likely the SMT implementation). It's all very worrying. :/
Posted on Reply
#47
BiggieShady
As predicted by many ...
This issue will be fixed in a new AGESA [AMD Generic Encapsulated Software Architecture] microcode
Posted on Reply
#48
Fluffmeister
kid41212003What's your point? What are you trying to say? TPU is simply reporting the news. Is this serious if left unfixed? Yes. Should TPU just stop reporting stuffs? No.
Plenty of damage control going on at the moment.
Posted on Reply
#49
Nkd
notbI'm pretty amazed by the comments...

It seems most people really don't understand how this problem works - looking at all the comments saying that you can crash any system with some code (and the Tesla on diesel stuff as well...)

And because many of you have already said that this can be PROBABLY fixed by microcode, it's almost natural to ask a question: what if it can't be fixed? :) Any bets?

Either way, IMO this is another sign that there's something deeply wrong with Ryzen architecture (most likely the SMT implementation). It's all very worrying. :/
So prime, realbench for days, and then games all that use SMT didn't crash once. This program crashed that they admit does not currently support Zen. So what is so deeeeeply wrong with zen? Sound like you are more interested in exaggerating the problem. Your comment was fine until the last sentence where you made it a major flaw. This will likely be fixed with micro code update if anything.
Posted on Reply
#50
TheGuruStud
kid41212003What's your point? What are you trying to say? TPU is simply reporting the news. Is this serious if left unfixed? Yes. Should TPU just stop reporting stuffs? No.
I guess if news is synonymous with tabloid material, b/c that's how these posts appear. While the story is real, it's like CNN's breaking news, "Trump didn't tell press he went to dinner!"
Posted on Reply
Add your own comment
Nov 21st, 2024 11:20 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts