Saturday, June 3rd 2023
Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime
AMD recently published an errata for their second generation EPYC processors based on Zen 2 which states that, "A core will fail to exit CC6 after about 1044 days after the last system reset." 1044 days is roughly 34 months, or just shy of 3 years of total uptime, and is actually an over estimate according to some sysadmin sleuths on Reddit and Twitter that did the math and discovered the actual time is 1042 days and 12 hours. The problem occurs because the CPU REFCLK counts 10ns ticks in a 54-bit signed integer, and if you count just over 9 quadrillion of these ticks you get the resulting overflow at 1042.4999 days. Once this overflow occurs the cores are stuck forever in a zombie state, and will not take any external interrupt requests. Well, forever until you flip the power switch off and back on again, which will reset the counter.
It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart. Though this does put EPYC "Rome" out of the running for any possible awards for longest running systems, it may serve as a reminder to initiate system updates or patches for other vulnerabilities that have been discovered in the four years since that generation of processor were first launched. AMD does not plan to issue any fix for the CC6 bug, instead recommending that administrators disable CC6 to avoid the cores entering the zombified state, or simply initiating a restart every once in awhile before the time limit expires.
Sources:
AMD, Tom's Hardware
It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart. Though this does put EPYC "Rome" out of the running for any possible awards for longest running systems, it may serve as a reminder to initiate system updates or patches for other vulnerabilities that have been discovered in the four years since that generation of processor were first launched. AMD does not plan to issue any fix for the CC6 bug, instead recommending that administrators disable CC6 to avoid the cores entering the zombified state, or simply initiating a restart every once in awhile before the time limit expires.
58 Comments on Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime
[S]Bug[/S] Feature in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime
Patch and reboot your damn stuff, FFS.
Nostra, Incans(?) & Mayans sure did a number on millions at the time :slap:
en.m.wikipedia.org/wiki/Year_2038_problem To reach that goal, you better have servers with hot-swappable CPUs too. I don't know much about that ability but Wikipedia says it's "common". So you pull out a CPU and put it back in, and hopefully that timer will be reset.
en.m.wikipedia.org/wiki/Hot_swapping
Restarting your server every once in a while is dumb, there are many scenarios where rebooting is undesired.
Why CC6 isn't off by default on servers however...
I’m amazed that such a modern processor as EPYC has an n bit limit clock problem. Where the design teams asleep?
since EPYC is a server processor, rebooting should not have to be part of the standard operating procedures, and i can imagine many use case scenarios where this clock problem can cause chaos, esp, futures markets.
I bet that the exact same issue would manifest on a consumer-grade Ryzen Threadripper processor, too, maybe even the socket AM4 counterparts.
Other hand, this is some proper uptime...
redd.it/11g39zj/
In case of AI uprising, all machines based on EPIC would automatically go to sleep after 34 months of uptime.
Humanity saved thanks to this "feature"! :D
It's a surprising oversight for a server part.
@JAKra get a job a Cyberdyne, these firm's need this kind of thought:)
its like amd doesn't want to sell datacenter chips
plenty of servers don't get rebooted for a year brauh no
ipmi and idrac exist