Thursday, March 16th 2017
AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions
An AMD Ryzen 7-1800X powered machine was found to be crashing upon execution of a very specific set of FMA3 instructions by Flops version 2, a simple open-source CPU benchmark by Alexander "Mystical" Yee. An important point to note here is that this little known benchmark has been tailored by its developer to be highly specific to the CPU micro-architecture, with separate binaries for each major x64 architecture (eg: Bulldozer, Sandy Bridge, Haswell, Skylake, etc.), and as such the GitHub repository does not have a "Zen" specific binary.
Members of the HWBot forums found that Ryzen powered machines crash on running the Haswell-specific binary, at "Single-Precision - 128-bit FMA3 - Fused Multiply Add." The Haswell-specific binary (along with, we imagine, Skylake), adds support for the FMA3 instruction-set, which Ryzen supports, and which lends some importance to the discovery of this bug. What also makes this important is because a simple application, running at user privileges (i.e. lacking special super-user/admin privileges), has the ability to crash the machine. Such a code could even be executed through virtual machines, and poses a security issue, with implications for AMD's upcoming "Naples" enterprise processor launch.
Members of the HWBot forums found that Ryzen powered machines crash on running the Haswell-specific binary, at "Single-Precision - 128-bit FMA3 - Fused Multiply Add." The Haswell-specific binary (along with, we imagine, Skylake), adds support for the FMA3 instruction-set, which Ryzen supports, and which lends some importance to the discovery of this bug. What also makes this important is because a simple application, running at user privileges (i.e. lacking special super-user/admin privileges), has the ability to crash the machine. Such a code could even be executed through virtual machines, and poses a security issue, with implications for AMD's upcoming "Naples" enterprise processor launch.
62 Comments on AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions
Also, it is confirmed that w/o SMT the benchmark is running fine, so the problem is not FMA, but once again - SMT conflicts.
...khm-khm... OpenMP... khm-khm....
Amazon Cloud Node -> N x Business -> N x M x EndUsers
Nothing new here, move along...
Oh..But don't use it lol
Guys, this IS a big deal. As others have noted an unknown instruction is supposed to raise an "Undefined Opcode" exception, something that predates even 16-bit protected mode. On CPUs which offer 'User' and 'Kernel' mode (ie everything since the mid 80s) the exception is handled by the operating system, which usually just kills off the process. The whole idea of User mode is that no User mode program can screw with the system without 'permission' from the OS.
This is similar to the Cyrix coma bug or the Pentium F00F bug. However I agree that this can probably be fixed in microcode.
Lets look at the Intel 7700K errata list.
"
Revision
Description
Date
001
Initial release
August 2016
002
• Errata
Added errata KBL068-078
Updated erratum KBL062
Fixed erratum KBL063
November 2016
003
• Added SKUs Y/U w/iHDCP2.2, S/H-Processor lines
• Added Table 2, S/H-Processor Lines Component Identification
• Identification Information
Added Table 4, Y-Processor Line With iHDCP2.2
Added Table 6, U-Processor Line With iHDCP2.2
Added Figure 3, S-Processor Line LGA Top-Side Markings
Added Table 7, S-Processor Line
Added Figure 4, H-Processor Line BGA Top-Side Markings
Added Table 8, H-Processor Line
• Errata
Updated Table 13, Errata Summary Table
Added errata KBL079-083
January 2017
004
• Identification Information
Updated Table 4, Y-Processor Line With iHDCP2.2
• Errata
Updated Table 13, Errata Summary Table. Added J-1 stepping
Updated KBL080
Added errata KBL084-091
February 2017
§"
All processors have flaws, and a future stepping, or even current stepping with an update to microcode.
Big deal if left unpatched or unfixed? Yep. Will it be fixed? Yep.
There are dozens of ways you can hang, BSOD, mess up your machine from userspace.
TO EVERYONE:
It's not even known or clear, whether the bug pertains to FMA instructions at all. It was only assumed, because benchmark BSODed on the FMA3 256bit benchmark stage, and only with SMT enabled.
The reason could be anything, from Windows bug, or libgomp bug, or SMT on Zen itself, or some other unknown factor.
Let's not jump to any conclusions before even knowing what the problem is.
BTW I just now read the HWbot post. For some reason I thought it was a reset like a triple fault. The Coma and F00F bugs were a better analogy than I realized.
I actually have written a simple operating system, though I wouldn't recommend designing as you go like I did.
(random i know; but he/she have their profile private)
It seems most people really don't understand how this problem works - looking at all the comments saying that you can crash any system with some code (and the Tesla on diesel stuff as well...)
And because many of you have already said that this can be PROBABLY fixed by microcode, it's almost natural to ask a question: what if it can't be fixed? :) Any bets?
Either way, IMO this is another sign that there's something deeply wrong with Ryzen architecture (most likely the SMT implementation). It's all very worrying. :/