Saturday, June 3rd 2023

Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

AMD recently published an errata for their second generation EPYC processors based on Zen 2 which states that, "A core will fail to exit CC6 after about 1044 days after the last system reset." 1044 days is roughly 34 months, or just shy of 3 years of total uptime, and is actually an over estimate according to some sysadmin sleuths on Reddit and Twitter that did the math and discovered the actual time is 1042 days and 12 hours. The problem occurs because the CPU REFCLK counts 10ns ticks in a 54-bit signed integer, and if you count just over 9 quadrillion of these ticks you get the resulting overflow at 1042.4999 days. Once this overflow occurs the cores are stuck forever in a zombie state, and will not take any external interrupt requests. Well, forever until you flip the power switch off and back on again, which will reset the counter.

It's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart. Though this does put EPYC "Rome" out of the running for any possible awards for longest running systems, it may serve as a reminder to initiate system updates or patches for other vulnerabilities that have been discovered in the four years since that generation of processor were first launched. AMD does not plan to issue any fix for the CC6 bug, instead recommending that administrators disable CC6 to avoid the cores entering the zombified state, or simply initiating a restart every once in awhile before the time limit expires.
Sources: AMD, Tom's Hardware
Add your own comment

58 Comments on Bug in AMD EPYC "Rome" Processors Puts Them to Sleep After 34 Months of Uptime

#26
chrcoluk
TheoneandonlyMrKWhile I agree , if you're not performing some sort of upgrade, maintenance or check, in three years of a server part in a normal IT department I would be surprised, filters need cleaning and three years is quite long to use the same server hardware ( considers own IT DPT), maybe though. :D :)

It's a surprising oversight for a server part.

@JAKra get a job a Cyberdyne, these firm's need this kind of thought:)
You be surprised, big projects I have worked on basically just add new hardware to the pool of servers, with the oldest stuff staying in place for long periods of time.

One year+ uptime not unusual either.
Posted on Reply
#27
Jeager
JAKra"It's Not a Bug, It's a Feature."
In case of AI uprising, all machines based on EPIC would automatically go to sleep after 34 months of uptime.
Humanity saved thanks to this "feature"! :D
Exactly + saving the planet from global warming, thanks you AMD !
Posted on Reply
#28
Wirko
We will soon see, I mean in 7 months, if Zen 3 Epycs inherited the exact same bug.
Posted on Reply
#29
Easo
OneMoarbrauh no
ipmi and idrac exist
Sure they do - and what is the message when you update BIOS in iLO? Reset power to the system. For HPE I can tell you that same happens with things like RAID controller, network card, FC card and disk firmwares (and all those are not iLO updatable).
You seem to be mistaking remote management chips with actual BIOS.
Posted on Reply
#30
HD64G
One of the most harmless bugs ever. Servers don't get to sleep mode and no PC stays on for more than 1 year (or it gets neglected and gets damaged without proper maintenance).
Posted on Reply
#31
Wirko
HD64GOne of the most harmless bugs ever. Servers don't get to sleep mode and no PC stays on for more than 1 year (or it gets neglected and gets damaged without proper maintenance).
On the other hand, an entire datacenter or a large part of it may go out in a timespan of a few minutes because all nodes were turned on at the same time. And it might be something mission-critical that never connects to the internet and does not need updates.
Posted on Reply
#32
HD64G
WirkoOn the other hand, an entire datacenter or a large part of it may go out in a timespan of a few minutes because all nodes were turned on at the same time. And it might be something mission-critical that never connects to the internet and does not need updates.
I would like to learn from an official source if that bug happened to any of the installed servers. I guess not.
Posted on Reply
#33
Dr. Dro
HD64GI would like to learn from an official source if that bug happened to any of the installed servers. I guess not.
A late bug such as this was likely discovered and learned about in a production environment ;)
Posted on Reply
#34
HD64G
Dr. DroA late bug such as this was likely discovered and learned about in a production environment ;)
Not going to argue on that but not sure about that. Are you?
Posted on Reply
#35
Dr. Dro
HD64GNot going to argue on that but not sure about that. Are you?
I can't prove it if that's what you're implying, but given its a bug that takes literally 3 years to manifest it's not so hard to believe that was the case. Or at a bare minimum, AMD must have one running 24/7 in a QA lab and ran into it ;)
Posted on Reply
#36
Scrizz
Dr. DroA late bug such as this was likely discovered and learned about in a production environment ;)
yeah most likely a customer(enterprise) hit it and complained to AMD
Posted on Reply
#37
AusWolf
Scrizzyeah most likely a customer(enterprise) hit it and complained to AMD
I can imagine that.

"Hey, I haven't restarted our server for 3 years and now it's acting weird."
"Huh? You what now? 3 years?"
:roll:
Posted on Reply
#39
zlobby
Dr. DroI can't prove it if that's what you're implying, but given its a bug that takes literally 3 years to manifest it's not so hard to believe that was the case. Or at a bare minimum, AMD must have one running 24/7 in a QA lab and ran into it ;)
As a bare minimum, AMD (and everyone else for that matter) should at least evaluate what happens with counters and timers in border cases, and when.
Posted on Reply
#40
Dr. Dro
zlobbyAs a bare minimum, AMD (and everyone else for that matter) should at least evaluate what happens with counters and timers in border cases, and when.
Agreed. Edge cases indeed must be accounted for, and this one seems like it's a bug related to the CPU's low power mode functionality. Disabling CC6/arguing that many boards have CC6 off by default is one thing but, let's be honest, we live in the age where people make a big deal out of a few watts in their tree-hugging "save the planet" craze, and such functionality would likely be enabled by company directive if it's available, so it should, to the extent that it's offered, work.
Posted on Reply
#41
Mussels
Freshwater Moderator
Uptime?
Naptime!

Definitely Errata, but also... highly unlikely to be a major issue for anyone.
Posted on Reply
#42
claes
Not sure you mean “errata” here, but this editor is wonky as hell
Posted on Reply
#43
Zareek
Running for a year without a reboot isn't abnormal for some servers. I should restate that. Running a year without rebooting isn't abnormal for some Linux servers. That being said, I have seen a server that ran for over 5 years without a reboot. It was not on purpose, just more like it got forgotten about and kept doing its thing without issues. I'm on the fence about this. Almost three years is a long time. Rome launched almost 4 years ago, I'm guessing that is why it isn't going to be fixed. I assume, since we are just hearing about it now, it wasn't in the first gen of EPYC. Is it still an issue? Is Milan or Genoa affected? Nice can of worms...
Posted on Reply
#44
claes
Hell I once run a Mac mini server for five years without a reboot (only rebooted to update to APFS, probably unneeded). Have had a pi running for six now. Lucky not to have any power outages, only my nas and main are attached to UPSs. Idk why I am sharing this carry on
Posted on Reply
#45
A Computer Guy
claesHell I once run a Mac mini server for five years without a reboot (only rebooted to update to APFS, probably unneeded). Have had a pi running for six now. Lucky not to have any power outages, only my nas and main are attached to UPSs. Idk why I am sharing this carry on
Don't forget to test your UPS batteries!
Posted on Reply
#46
Wirko
Dr. DroAgreed. Edge cases indeed must be accounted for, and this one seems like it's a bug related to the CPU's low power mode functionality. Disabling CC6/arguing that many boards have CC6 off by default is one thing but, let's be honest, we live in the age where people make a big deal out of a few watts in their tree-hugging "save the planet" craze, and such functionality would likely be enabled by company directive if it's available, so it should, to the extent that it's offered, work.
Think about a large server installation or a HPC/supercomputing cluster. Each 2-processor node consumes a couple hundred watts on idle. It makes a lot of sense to put some nodes on sleep when the system is not fully loaded. I don't know if C6 (deep power down state) is used for that purpose or not but it seems to be the appropriate mechanism.
Posted on Reply
#47
Aquinus
Resident Wat-man
FouquinIt's certainly impressive that this problem was discovered at all, as it suggests that more than a single system has been running for almost three years straight without a single restart.
Ehhhhh, when I first started my professional career I worked with a colo server with over 1000 days of uptime on it. It was a database server. That streak was broken when we switched from Rackspace to Google. Believe it or not, this is very normal for Linux, in particular with builds that can do live kernel patching while the system is hot.
Posted on Reply
#48
Mussels
Freshwater Moderator
claesNot sure you mean “errata” here, but this editor is wonky as hell
a bug in CPU's is known as errata, from the latin erratum meaning a printing error

It's in the first line of the news



It's an unexpected bug but also non-critical, since it's very very easy to plan something as simple as yearly maintenance, let alone 3 yearly
Posted on Reply
#49
claes
Right… the news piece refers to an error in documentation, not a processing error in and of itself, but you bothered to look up the Latin so I’m sure you know :)

en.m.wikipedia.org/wiki/Erratum
Posted on Reply
#50
Dr. Dro
It's semantics :)

A flaw in a processor design is also designated an erratum in tech jargon. It may or may not be correctable, and their severity can range from low to extreme, usually errata which are low severity such as this one are documented but no fix is issued or planned. In these cases, the chipmaker documents the problem and suggests a workaround.

Occasionally higher severity issues are also not fixed (such as Milan's USB stack reset problem), or the fixes come at a performance penalty (eg. Intel's fixes for speculative execution exploits). Very rarely, erratum of extreme severity result in designs being cancelled or permanently recalled. These rarely get past ES stage, but it can occur.
Posted on Reply
Add your own comment
Dec 18th, 2024 01:01 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts