Thursday, March 16th 2017

AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions

An AMD Ryzen 7-1800X powered machine was found to be crashing upon execution of a very specific set of FMA3 instructions by Flops version 2, a simple open-source CPU benchmark by Alexander "Mystical" Yee. An important point to note here is that this little known benchmark has been tailored by its developer to be highly specific to the CPU micro-architecture, with separate binaries for each major x64 architecture (eg: Bulldozer, Sandy Bridge, Haswell, Skylake, etc.), and as such the GitHub repository does not have a "Zen" specific binary.

Members of the HWBot forums found that Ryzen powered machines crash on running the Haswell-specific binary, at "Single-Precision - 128-bit FMA3 - Fused Multiply Add." The Haswell-specific binary (along with, we imagine, Skylake), adds support for the FMA3 instruction-set, which Ryzen supports, and which lends some importance to the discovery of this bug. What also makes this important is because a simple application, running at user privileges (i.e. lacking special super-user/admin privileges), has the ability to crash the machine. Such a code could even be executed through virtual machines, and poses a security issue, with implications for AMD's upcoming "Naples" enterprise processor launch.
Add your own comment

62 Comments on AMD Ryzen Machine Crashes to a Sequence of FMA3 Instructions

#51
notb
NkdSo prime, realbench for days, and then games all that use SMT didn't crash once. This program crashed that they admit does not currently support Zen. So what is so deeeeeply wrong with zen? Sound like you are more interested in exaggerating the problem. Your comment was fine until the last sentence where you made it a major flaw. This will likely be fixed with micro code update if anything.
As I've said: it seems people don't understand the issue.

It's not about compatibility or how rare the problematic instruction is used in software.
It's about the fact that this architecture can be crashed with a single line of code, which should not happen, ever. If a CPU can't execute some code, it should handle this exception in a safe way. Ryzen simply dies.
This is a big stability risk and - as far as enterprise segment - a threat that would make Ryzen unacceptable in commercial applications.

Moreover, while it is rumored that AMD knows how to fix this and the microcode update is being developed, AMD gave no official statement nor deadline. It's already been few days since the issue was revealed..
Posted on Reply
#52
ty_ger
notbAs I've said: it seems people don't understand the issue.

It's not about compatibility or how rare the problematic instruction is used in software.
It's about the fact that this architecture can be crashed with a single line of code, which should not happen, ever. If a CPU can't execute some code, it should handle this exception in a safe way. Ryzen simply dies.
This is a big stability risk and - as far as enterprise segment - a threat that would make Ryzen unacceptable in commercial applications.

Moreover, while it is rumored that AMD knows how to fix this and the microcode update is being developed, AMD gave no official statement nor deadline. It's already been few days since the issue was revealed..
This isn't the first CPU to exhibit this sort of behavior. As mentioned: flawed humans, with flawed understanding, making flawed products, in their flawed universe.

I do understand why it is news and I do understand why it is important. But, 3 pages later, it is getting stretched pretty thin.

Future you says: Oh, I guess they fixed it. It wasn't such a big deal after all. Time to move on.
Posted on Reply
#53
notb
ty_gerThis isn't the first CPU to exhibit this sort of behavior.
But this is a first CPU to do this in a very long time.
Sorry, but an argument that something happened years ago (Coppermine in 2001?) is by no means helping AMD.
Seriously, we became so spoiled by CPUs that just work - having close to none compatibility conflicts, setting themselves up, overclocking automatically etc.
AMD gave us a CPU which once makes you spend weeks on reading about issues, finding a rare RAM that works etc. We're once again waiting for some patches to fix crucial issues...

I totally understand they were committed to maximize performance and this CPU is really squeezed to the limits, but haven't they gone too far?
Quite a few people have reported that this FMA3 issue can be fixed (or greatly limited) by upping voltage. Oh come on... do we deserve being treated like that? :/
ty_gerFuture you says: Oh, I guess they fixed it. It wasn't such a big deal after all. Time to move on.
It's a huge deal and will not be forgotten by reviewers and enthusiasts. I would compare it to the latest Samsung's battery fail. What saves AMD is that - apart from some gamers and geeks, no one really cares (generally speaking not that many know what AMD is).
Posted on Reply
#54
xorbe
notbIt's about the fact that this architecture can be crashed with a single line of code
Please post the line of code. You don't really have any clue what you are babbling about. So why are you interesting in making a big stink (this isn't your first rodeo either, we all know that). One would assume enterprise chips will fix whatever was found on the first round of PC parts.
Posted on Reply
#55
OSdevr
xorbePlease post the line of code. You don't really have any clue what you are babbling about.
Said "line of code" would compile differently depending on your choice of compiler and it's settings. BTW how the hell are people NOT GETTING THIS? The problem (now fixed as expected) was that a certain instruction or stream of instructions could hang the CPU. The program that did this, obscure or not, didn't matter. Even if Ryzen didn't know how to process the FMA3 instructions it should have just issued an exception and the OS would have handled it (haha no pun intended!).

This problem wasn't like a game crashing to the desktop, it wasn't even like a BSOD and you had to reboot. Depending on your system, you may or may not have even been able to turn off your computer by holding the on/off button! This happened to me once, guess how.

As I said earlier, this is similar to the Cyrix coma bug or the Pentium F00F bug.
Posted on Reply
#56
notb
xorbePlease post the line of code. You don't really have any clue what you are babbling about. So why are you interesting in making a big stink (this isn't your first rodeo either, we all know that). One would assume enterprise chips will fix whatever was found on the first round of PC parts.
Honestly, I'm not sure, but I guess it would look something like this:
VFMADD132PDx %a, %b, %c

The great thing is that the benchmark used to reveal this bug is open source. Everyone willing to hang their (or - for that matter - someone else's) Ryzen can check how the code forces FMA3 usage. :) Basically, you can force that while compiling (even when coding in a high-level language).
Posted on Reply
#57
xorbe
notbHonestly, I'm not sure, but I guess it would look something like this:
VFMADD132PDx %a, %b, %c
You're missing the point that you think it's a single line of code.
Raevenlord
  • Resolved a condition where an unusual FMA3 code sequence could cause a system hang.
Posted on Reply
#58
OSdevr
xorbeYou're missing the point that you think it's a single line of code.
Umm, you do know how a compiler works, right?

And so what if it takes more than one line of code in your language of choice? Sorry but your being rather pedantic about this.
Posted on Reply
#59
xorbe
OSdevrSorry but your being rather pedantic about this.
Sure, to counter some grand sweeping mud being thrown at the wall. Get it right if you're gonna do that.
Posted on Reply
#60
R-T-B
RejZoRI'm pretty sure you can crash ANY system by feeding it with instructions that are not meant for it.
Actually, it's not really that simple at all. You aren't supposed to be able to crash a complete system. A process, sure. A system? No. That's bad.
Posted on Reply
#61
Aquinus
Resident Wat-man
RejZoRI'm pretty sure you can crash ANY system by feeding it with instructions that are not meant for it. And we know how "standards" work with instructions. If they really were 100% standard, then they'd exhibit IDENTICAL performance gains on ALL CPU's. Which we know for a fact it's not true...
CPUs have a register that stores flags for representing when something goes wrong and have their own form of exception handling, such as division by zero, overflow, etc. The problem is that if the machine is unstable handling exceptions, how do you recover from recovering from exception processessing? It's not like there is an "Exception, Exception" flag. Since it sounds like this isn't an issue when overclocking, it's possible that bumping the voltage is making the CPU stable enough to not cause a problem, indicating that as far as FMA3 is concerned, the CPU might be running a little lean with respect to voltage to keep it stable.
Posted on Reply
#62
OSdevr
AquinusIt's not like there is an "Exception, Exception" flag.
Actually there is. It's called aDouble Fault but it is indeed unrecoverable.
Posted on Reply
Add your own comment
Nov 21st, 2024 11:34 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts