Saturday, June 3rd 2023

Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

AMD recently published an errata for their second generation EPYC processors based on Zen 2 which states that, "A core will fail to exit CC6 after about 1044 days after the last system reset." 1044 days is roughly 34 months, or just shy of 3 years of total uptime, and is actually an over estimate according to some sysadmin sleuths on Reddit and Twitter that did the math and discovered the actual time is 1042 days and 12 hours. The problem occurs because the CPU REFCLK counts 10ns ticks in a 54-bit signed integer, and if you count just over 9 quadrillion of these ticks you get the resulting overflow at 1042.4999 days. Once this overflow occurs the cores are stuck forever in a zombie state, and will not take any external interrupt requests. Well, forever until you flip the power switch off and back on again, which will reset the counter.

It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart. Though this does put EPYC "Rome" out of the running for any possible awards for longest running systems, it may serve as a reminder to initiate system updates or patches for other vulnerabilities that have been discovered in the four years since that generation of processor were first launched. AMD does not plan to issue any fix for the CC6 bug, instead recommending that administrators disable CC6 to avoid the cores entering the zombified state, or simply initiating a restart every once in awhile before the time limit expires.
Sources: AMD, Tom's Hardware
Add your own comment

58 Comments on Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

#51
Mussels
Freshwater Moderator
claesRight… the news piece refers to an error in documentation, not a processing error in and of itself, but you bothered to look up the Latin so I’m sure you know :)

en.m.wikipedia.org/wiki/Erratum
You're arguing about quite an odd thing.
It's used correctly in the news article, because that's what it is.
Posted on Reply
#52
claes
Dr. DroIt's semantics :)

A flaw in a processor design is also designated an erratum in tech jargon. It may or may not be correctable, and their severity can range from low to extreme, usually errata which are low severity such as this one are documented but no fix is issued or planned. In these cases, the chipmaker documents the problem and suggests a workaround.

Occasionally higher severity issues are also not fixed (such as Milan's USB stack reset problem), or the fixes come at a performance penalty (eg. Intel's fixes for speculative execution exploits). Very rarely, erratum of extreme severity result in designs being cancelled or permanently recalled. These rarely get past ES stage, but it can occur.
You’re still referring to documentation, or an appendix, not the processor’s error itself :)
MusselsYou're arguing about quite an odd thing.
It's used correctly in the news article, because that's what it is.
Like I said, the article used the term correctly (so is Intel) :)

I don’t really want to do this again (see our debate about what a router is) — I agree to disagree, because who cares? Have a good night!
Posted on Reply
#53
Mussels
Freshwater Moderator
claesI don’t really want to do this again (see our debate about what a router is) — I agree to disagree, because who cares? Have a good night!
Oh of course. Sorry, forgot that if you get the last word in and say goodnight it means you're correct. Of course.
That thread became a dumpster fire and useless with all the useless arguing over what you call something, vs what it's actually called.

Maybe work on that.
Posted on Reply
#54
claes
No I’ll play at any hour, I was just offering you some outs and reminding you what happened the last time we got into “semantics” :)

Errata/erratum are not a “bug” in an electronic product, they are/it is an appendix to the documentation of said product.
Posted on Reply
#55
A Computer Guy
claesNo I’ll play at any hour, I was just offering you some outs and reminding you what happened the last time we got into “semantics” :)

Errata/erratum are not a “bug” in an electronic product, they are/it is an appendix to the documentation of said product.
Hopefully not adding fuel to the fire but to my understanding, in the specific context of technology, hardware/software erratum/errata is accepted lexicon for published acknowledgment of having or finding a "bug" or "error".
Posted on Reply
#56
Mussels
Freshwater Moderator
claesNo I’ll play at any hour, I was just offering you some outs and reminding you what happened the last time we got into “semantics” :)

Errata/erratum are not a “bug” in an electronic product, they are/it is an appendix to the documentation of said product.
It's not 'playing' when you're told to stop derailing threads by a moderator and you think that's entertaining.
enoughs enough.
Posted on Reply
Add your own comment
Dec 18th, 2024 01:03 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts