Monday, April 24th 2023
AMD Ryzen 7000X3D Processors Prone to Physical Damage with Voltage-assisted Overclocking, Motherboard Vendors Rush BIOS Updates with Voltage Limiters
AMD Ryzen 7000X3D processors are prone to irreversible physical damage if CPU overclocking is attempted at some of the higher VDDCR voltages (the main power domain for the CPU cores). A Redditor who goes by Speedrookie, attempted to overclock their Ryzen 7 7800X3D, leading to an irreversible failure. The motherboard socket and the processor's land-grid contacts, show signs of overheating damage caused by the contacts melting from too much current draw.
A Ryzen 7000X3D processor features a special CPU complex die (CCD) with stacked 3D Vertical Cache memory. This cache die is located in the central region over the CCD where its 32 MB on-die L3 cache is located, while the difference in Z-height of the stacked die is filled up by structural silicon, which sit over the regions of the CCD with the 8 "Zen 4" CPU cores. It stands to reason that besides having an inferior thermal transfer setup to conventional "Zen 4" CCDs (without the 3DV cache), the CCD itself has a higher power-draw at any given clock-speed than a conventional CCD (since it's also powering the L3D). This is the main reason why overclocking capabilities on the 7000X3D processors are almost non-existent, and the processor's power limits are generally lower than their regular Ryzen 7000X counterparts. Attempting to dial up voltage kicks up the perfect storm for these processors.Igor's Lab posted a detailed analysis of the region of the Socket AM5 land-grid most susceptible to a burn-out in the above scenario. The central region of the LGA has 93 pins dedicated to the VDDCR power domain, dispersed in a mostly checkered pattern, toward the center of the land-grid. Igor isolated 6 of these VDDCR pins in particular, which are most prone to physical damage, as they are located in a region below the CCD that sees it sandwiched between the L3D (stacked 3D Vertical cache die), and the fiberglass substrate below. Apparently, AMD's thermal and electrical protection mechanisms aren't able to prevent a runaway overheating of the pins that causes the substrate to melt, deform, and bulge outward, resulting in irreversible damage to both the processor and the socket.
Meanwhile, AMD's motherboard partners are rushing to release UEFI BIOS updates for their entire lineups of motherboards, which enforce tighter limits on the VDDCR voltage. MSI is the first motherboard manufacturer with such updates. MSI, in a press statement, stated that it has redesigned automated overclocking for 7000X3D processors. "The BIOS now only supports negative offset voltage settings, which can reduce the CPU voltage only," the MSI statement to Tom's Hardware reads. "MSI Center also restricts any direct voltage and frequency adjustments, ensuring that the CPU won't be damaged due to over-voltage." On the other hand, the update introduces an automated overclocking feature called Enhanced Mode Boost, which optimizes PBO settings to improve boost frequency residency, without any manual voltage adjustments.
Sources:
Tom's Hardware 1, 2, Igor's Lab, Speedrookie (Reddit)
A Ryzen 7000X3D processor features a special CPU complex die (CCD) with stacked 3D Vertical Cache memory. This cache die is located in the central region over the CCD where its 32 MB on-die L3 cache is located, while the difference in Z-height of the stacked die is filled up by structural silicon, which sit over the regions of the CCD with the 8 "Zen 4" CPU cores. It stands to reason that besides having an inferior thermal transfer setup to conventional "Zen 4" CCDs (without the 3DV cache), the CCD itself has a higher power-draw at any given clock-speed than a conventional CCD (since it's also powering the L3D). This is the main reason why overclocking capabilities on the 7000X3D processors are almost non-existent, and the processor's power limits are generally lower than their regular Ryzen 7000X counterparts. Attempting to dial up voltage kicks up the perfect storm for these processors.Igor's Lab posted a detailed analysis of the region of the Socket AM5 land-grid most susceptible to a burn-out in the above scenario. The central region of the LGA has 93 pins dedicated to the VDDCR power domain, dispersed in a mostly checkered pattern, toward the center of the land-grid. Igor isolated 6 of these VDDCR pins in particular, which are most prone to physical damage, as they are located in a region below the CCD that sees it sandwiched between the L3D (stacked 3D Vertical cache die), and the fiberglass substrate below. Apparently, AMD's thermal and electrical protection mechanisms aren't able to prevent a runaway overheating of the pins that causes the substrate to melt, deform, and bulge outward, resulting in irreversible damage to both the processor and the socket.
Meanwhile, AMD's motherboard partners are rushing to release UEFI BIOS updates for their entire lineups of motherboards, which enforce tighter limits on the VDDCR voltage. MSI is the first motherboard manufacturer with such updates. MSI, in a press statement, stated that it has redesigned automated overclocking for 7000X3D processors. "The BIOS now only supports negative offset voltage settings, which can reduce the CPU voltage only," the MSI statement to Tom's Hardware reads. "MSI Center also restricts any direct voltage and frequency adjustments, ensuring that the CPU won't be damaged due to over-voltage." On the other hand, the update introduces an automated overclocking feature called Enhanced Mode Boost, which optimizes PBO settings to improve boost frequency residency, without any manual voltage adjustments.
258 Comments on AMD Ryzen 7000X3D Processors Prone to Physical Damage with Voltage-assisted Overclocking, Motherboard Vendors Rush BIOS Updates with Voltage Limiters
By constantly trying to one up each-other, It is only a matter of time until one of them went too far.
The CPU by itself is just an inert object. It cannot generate any of the voltages it requires to function.
I have never heard of CPUs killing motherboards, only the other way around.
Unless all vendors bypass those protection (and then it will be an amusing shit-show to watch) than the CPU protection design is responsible as well.
The bus is long, plenty of room for all under it..
Also, it will be interesting debate about whos fault is when you get out-of-the-box OC from the mobo settings that practically cancel your warranty before even using the product.
For example on Intel's page tjmax on my 12700k is 100C. Yet my motherboard has the option to increase that to 118C and I did tested it with Prime95 smallFFT, the cpu stopped throttling @100C.
It shows that the mobo at least can override the protection to a certain degree.
Also, the design of the CPU sensors is solely AMD so if they over-look a (rare) case that enable the CPU to reach extreme temp than it`s their head on the spike (read as: just RMA and move on).
I guess that AMD make a basic bios and then each vendor modify to their suite so it may be that the CPU hardware\design\sensors is spotless and the problem is the basic bios from AMD that doesn't use the data in the right way (and then easy fix).
It is one thing for the CPU to signal the mobo to shutdown, it is another thing if the board keeps going or not.
It can put off many mainstream consumer that just want a solid, reliable build. Hope it will never come to that.
Those are the machines from the major OEMs like Dell / HP / Lenovo / Acer.
These machines tends to have almost no options in their bios thus very little points of failure. No XMP / EXPO / MCE / PBO whatsoever.
Also no finger pointing with who is at fault when things goes south, you just call the brand you brought it from.
There are many reasons why these are the machines you see in businesses.
I would assume even Asus' own non-ROG OEM machines are much more failsafe than their gaming stuff.
better go Intel hand-build instead, isn't it?
You seems to be under the impression that this can only happen on AMD machines.
There is one difference on the Intel side, on Intel there are the Non-K CPUs and B / H series boards that are much more lock down.
Until 12th gen you cannot even do memory OC on B series boards, those effectively performs like OEM machines.
So all of them, separately, doing the same hidden off-spec overvoltageing and\or bypass fundamental CPU protection?
Because all of them removed all old bios I suspect that AMD contact them and instruct what to do (as a new restrictive bios) until further info emerge.
There is specifally a "normal" setting instead of "auto" in each voltage setting on Gigabyte boards which were implemeneted up on reviewers request.
That setting means the board actually respects the Intel / AMD voltage spec instead of the presets "optimized" by the board vendor.
Additionally there is an "Intel POR" setting that sets both PL1 and PL2 to the Intel spec (on the 12700K @190W) instead of auto which is 4095W / unlimited.
This was added so that reviewers have an easier time testing the CPU actully at stock.
Just in case you missed this post from another user.
That thing is not a far fetched scenario as to with Intel side.
Shit can and have happen on Intel side more than once, but now it is AMD side who take the heat.
If you see a fight , you probably go elsewhere to buy ice-cream (except us, tech nerds, that pray on each yellowish detail and debate it to death..).
And on top of all, just as you said- going non-K Intel is pretty safe. And just like that, the restrictiveness of the platform ('safer' non-K) became it`s best merit...
On the older gen, as soon as you increase BCLK over 102 the system will refuse to boot, on 12th gen somehow it is no longer the case.
Intel Alder Lake non-K Overclocking via BCLK – der8auer
That Vcore though....
As if somehow the hierarchy of protection changed (or just plain cancelled) and the CPU was allowed to first go out-of-the-roof voltage (as 4095w analog) and only then apply the temp control (95-105 degree) as opposite to "go to unlimited W as long as under x temp". A nice fun fact (and limited compering real OC I might add) I know about, thank you. and so, it`s just strengthen your own argue that non-K is 'safer'. You need not-so-little manual bios changes to do that BCLK OC. No ordinary consumer will do that plus no vendor will ship an non-k BCLK-OC PC out of the box (if they have any sense in them..)
It's a bit like the Nvidia cables, if they were better designed, clear rules on how to use, systems in place to prevent misuse, QA, this would not have happened.
Now it's blame shifting, but i blame AMD and the partners. AMD surely tests their CPU's on the actual mobos they will be working on. Come on. This is like the most basic QA. But just like in games this days, we do the QA and beta testing, and engineering control.
Cases and cases keep pilling up, these companies, software and hardware just don't do QA, PERIOD.
We don`t need more restriction on every level of OC, just that the very basic to work rock solid (that is shutdown before any permanent and\or physical damage to the CPU\mobo).
My guess- a mix responsibly that start with AMD oversighting a rare case (as those incidents are very few so far) that caused by mobo vendors that changed the optimal settings (probably EXPO related) in order to get better out of the box pref for reviews, without telling it`s canceling warranty.
ZEN 4 is a new platform so, despite all the QA, a glitch found it's way out and make a mass.
Agree, the only time I tinker now would be undervolting to reduce power, modern CPUs and GPUs are basically pre binned to run at almost as fast as they can go out of the box. I remember trying on my 2700X, PBO was actually slower than stock as it hit temp throttling earlier, and when I tried manual o/c, even an extra 100mhz would freeze during cinebench.
5600G so fast out of the box, I didnt even bother trying, just disabled XFR for low voltage, power consumption and ran it that way since. (its been used mostly as a server type workload, so XFR not important). An eye opener, on my ASRock boards I observe enabling XMP pumps excessive voltage to System Agent. Also voltage to CPU at default was above intel stock vcore.
The bare minimum is to display a warning massage on POST\bios that requires bios update and alert from warranty cancel if not done.
And even if disregarding all of that by the user, the CPU shouldn't kill itself and take the mobo with it.
The time and effort spent to achieve a stable OC is not worth it especially the first 3 years of having a new gen GPU or CPU. Only after that it might make sense to extract that 5 to 15% at best extra. By that time the product is also worth less than what it is on day 1 anyways.