Saturday, June 3rd 2023

Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

AMD recently published an errata for their second generation EPYC processors based on Zen 2 which states that, "A core will fail to exit CC6 after about 1044 days after the last system reset." 1044 days is roughly 34 months, or just shy of 3 years of total uptime, and is actually an over estimate according to some sysadmin sleuths on Reddit and Twitter that did the math and discovered the actual time is 1042 days and 12 hours. The problem occurs because the CPU REFCLK counts 10ns ticks in a 54-bit signed integer, and if you count just over 9 quadrillion of these ticks you get the resulting overflow at 1042.4999 days. Once this overflow occurs the cores are stuck forever in a zombie state, and will not take any external interrupt requests. Well, forever until you flip the power switch off and back on again, which will reset the counter.

It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart. Though this does put EPYC "Rome" out of the running for any possible awards for longest running systems, it may serve as a reminder to initiate system updates or patches for other vulnerabilities that have been discovered in the four years since that generation of processor were first launched. AMD does not plan to issue any fix for the CC6 bug, instead recommending that administrators disable CC6 to avoid the cores entering the zombified state, or simply initiating a restart every once in awhile before the time limit expires.
Sources: AMD, Tom's Hardware
Add your own comment

58 Comments on Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

#1
natr0n
reminds me of y2k times
Posted on Reply
#2
Dirt Chip

[S]Bug[/S] Feature in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

Posted on Reply
#3
Easo
While the bug is stupid and should not have happened 3 years of uptime gives me shivers.
Patch and reboot your damn stuff, FFS.
Posted on Reply
#4
john_
I read somewhere some comments about Linux being able to update itself without rebooting, so there it does present a kind of a problem if someone needs to have a non stop server running until the Apocalypse.
Posted on Reply
#5
zmeul
EasoWhile the bug is stupid and should not have happened 3 years of uptime gives me shivers.
Patch and reboot your damn stuff, FFS.
live patching in GNU LInux exists, the servers don't need to be rebooted
Posted on Reply
#6
R0H1T
Yeah well it's a good idea rebooting your damn systems at least once in 3 years. Though depending on what they're used for, or shared by, it won't be easy.
natr0nreminds me of y2k times
You mean the end times? Or 2012 :laugh:

Nostra, Incans(?) & Mayans sure did a number on millions at the time :slap:
Posted on Reply
#7
natr0n
R0H1TYeah well it's a good idea rebooting your damn systems at least once in 3 years. Though depending on what they're used for, or shared by, it won't be easy.


You mean the end times? Or 2012 :laugh:

Nostra, Incans(?) & Mayans sure did a number on millions at the time :slap:
year 2000 bios bug/limitation with older pc
Posted on Reply
#8
R0H1T
Yes I meant there were also lots of weird predictions around those years, like I mentioned 99/2k & 2012 recently. I wonder when's the next world ending event supposed to come!
Posted on Reply
#9
natr0n
R0H1TYes I meant there were also lots of weird predictions around those years, like I mentioned 99/2k & 2012 recently. I wonder when's the next world ending event supposed to come!
Well no more world actually ending stuff. Lots of plaques and shit according to prophesy and what a persons beliefs are. If waters turn to blood I would panic if I didnt know about it.
Posted on Reply
#10
stanleyipkiss
...or you could just disable sleep state and go about your day.
Posted on Reply
#11
Wirko
R0H1TYes I meant there were also lots of weird predictions around those years, like I mentioned 99/2k & 2012 recently. I wonder when's the next world ending event supposed to come!
That's easy to answer, unless IBM, Intel and everyone else rip the remaining 32-bit abilities out of their processors by then.
en.m.wikipedia.org/wiki/Year_2038_problem
john_I read somewhere some comments about Linux being able to update itself without rebooting, so there it does present a kind of a problem if someone needs to have a non stop server running until the Apocalypse.
To reach that goal, you better have servers with hot-swappable CPUs too. I don't know much about that ability but Wikipedia says it's "common". So you pull out a CPU and put it back in, and hopefully that timer will be reset.
en.m.wikipedia.org/wiki/Hot_swapping
Posted on Reply
#12
Zubasa
stanleyipkiss...or you could just disable sleep state and go about your day.
In fact many server boards have if off by default.
Posted on Reply
#13
Nostras
ZubasaIn fact many server boards have if off by default.
Why this isn't such a widespread problem... Because of this.
Restarting your server every once in a while is dumb, there are many scenarios where rebooting is undesired.
Why CC6 isn't off by default on servers however...
Posted on Reply
#14
pavle
AMD sleeping on the job again...
Posted on Reply
#15
lemonadesoda
Fascinating read at en.m.wikipedia.org/wiki/Leap_second

I’m amazed that such a modern processor as EPYC has an n bit limit clock problem. Where the design teams asleep?

since EPYC is a server processor, rebooting should not have to be part of the standard operating procedures, and i can imagine many use case scenarios where this clock problem can cause chaos, esp, futures markets.
Posted on Reply
#16
Dr. Dro
lemonadesodaI’m amazed that such a modern processor as EPYC has an n bit limit clock problem. Where the design teams asleep?
I wouldn't be surprised if Xeon had a similar issue, just on a larger (much larger?) time scale.

I bet that the exact same issue would manifest on a consumer-grade Ryzen Threadripper processor, too, maybe even the socket AM4 counterparts.

Other hand, this is some proper uptime...

redd.it/11g39zj/
Posted on Reply
#17
TumbleGeorge
Nothing, just planned problems that, when accumulated over time, will force you to buy new hardware. Not a total wreck where you can sue them for compensation even after the product's warranty is over, but annoying.
Posted on Reply
#18
eidairaman1
The Exiled Airman
Engineer
lemonadesodaFascinating read at en.m.wikipedia.org/wiki/Leap_second

I’m amazed that such a modern processor as EPYC has an n bit limit clock problem. Where the design teams asleep?

since EPYC is a server processor, rebooting should not have to be part of the standard operating procedures, and i can imagine many use case scenarios where this clock problem can cause chaos, esp, futures markets.
Engineers aren't perfect
Posted on Reply
#19
JAKra
"It's Not a Bug, It's a Feature."
In case of AI uprising, all machines based on EPIC would automatically go to sleep after 34 months of uptime.
Humanity saved thanks to this "feature"! :D
Posted on Reply
#20
TheoneandonlyMrK
NostrasWhy this isn't such a widespread problem... Because of this.
Restarting your server every once in a while is dumb, there are many scenarios where rebooting is undesired.
Why CC6 isn't off by default on servers however...
While I agree , if you're not performing some sort of upgrade, maintenance or check, in three years of a server part in a normal IT department I would be surprised, filters need cleaning and three years is quite long to use the same server hardware ( considers own IT DPT), maybe though. :D :)

It's a surprising oversight for a server part.

@JAKra get a job a Cyberdyne, these firm's need this kind of thought:)
Posted on Reply
#21
thesmokingman
Yea, this is more about lazy admins. If you hit the bug you get fired lol.
Posted on Reply
#22
AusWolf
Even computers get tired sometimes.
Posted on Reply
#23
Easo
zmeullive patching in GNU LInux exists, the servers don't need to be rebooted
No - because you still need to update firmware for hardware, chiefly BIOS. You ARE updating them, yes...?
Posted on Reply
#24
OneMoar
There is Always Moar
no fixed planned?
its like amd doesn't want to sell datacenter chips
plenty of servers don't get rebooted for a year
EasoNo - because you still need to update firmware for hardware, chiefly BIOS. You ARE updating them, yes...?
brauh no
ipmi and idrac exist
Posted on Reply
Add your own comment
Dec 17th, 2024 22:27 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts