Monday, April 24th 2023
AMD Ryzen 7000X3D Processors Prone to Physical Damage with Voltage-assisted Overclocking, Motherboard Vendors Rush BIOS Updates with Voltage Limiters
AMD Ryzen 7000X3D processors are prone to irreversible physical damage if CPU overclocking is attempted at some of the higher VDDCR voltages (the main power domain for the CPU cores). A Redditor who goes by Speedrookie, attempted to overclock their Ryzen 7 7800X3D, leading to an irreversible failure. The motherboard socket and the processor's land-grid contacts, show signs of overheating damage caused by the contacts melting from too much current draw.
A Ryzen 7000X3D processor features a special CPU complex die (CCD) with stacked 3D Vertical Cache memory. This cache die is located in the central region over the CCD where its 32 MB on-die L3 cache is located, while the difference in Z-height of the stacked die is filled up by structural silicon, which sit over the regions of the CCD with the 8 "Zen 4" CPU cores. It stands to reason that besides having an inferior thermal transfer setup to conventional "Zen 4" CCDs (without the 3DV cache), the CCD itself has a higher power-draw at any given clock-speed than a conventional CCD (since it's also powering the L3D). This is the main reason why overclocking capabilities on the 7000X3D processors are almost non-existent, and the processor's power limits are generally lower than their regular Ryzen 7000X counterparts. Attempting to dial up voltage kicks up the perfect storm for these processors.Igor's Lab posted a detailed analysis of the region of the Socket AM5 land-grid most susceptible to a burn-out in the above scenario. The central region of the LGA has 93 pins dedicated to the VDDCR power domain, dispersed in a mostly checkered pattern, toward the center of the land-grid. Igor isolated 6 of these VDDCR pins in particular, which are most prone to physical damage, as they are located in a region below the CCD that sees it sandwiched between the L3D (stacked 3D Vertical cache die), and the fiberglass substrate below. Apparently, AMD's thermal and electrical protection mechanisms aren't able to prevent a runaway overheating of the pins that causes the substrate to melt, deform, and bulge outward, resulting in irreversible damage to both the processor and the socket.
Meanwhile, AMD's motherboard partners are rushing to release UEFI BIOS updates for their entire lineups of motherboards, which enforce tighter limits on the VDDCR voltage. MSI is the first motherboard manufacturer with such updates. MSI, in a press statement, stated that it has redesigned automated overclocking for 7000X3D processors. "The BIOS now only supports negative offset voltage settings, which can reduce the CPU voltage only," the MSI statement to Tom's Hardware reads. "MSI Center also restricts any direct voltage and frequency adjustments, ensuring that the CPU won't be damaged due to over-voltage." On the other hand, the update introduces an automated overclocking feature called Enhanced Mode Boost, which optimizes PBO settings to improve boost frequency residency, without any manual voltage adjustments.
Sources:
Tom's Hardware 1, 2, Igor's Lab, Speedrookie (Reddit)
A Ryzen 7000X3D processor features a special CPU complex die (CCD) with stacked 3D Vertical Cache memory. This cache die is located in the central region over the CCD where its 32 MB on-die L3 cache is located, while the difference in Z-height of the stacked die is filled up by structural silicon, which sit over the regions of the CCD with the 8 "Zen 4" CPU cores. It stands to reason that besides having an inferior thermal transfer setup to conventional "Zen 4" CCDs (without the 3DV cache), the CCD itself has a higher power-draw at any given clock-speed than a conventional CCD (since it's also powering the L3D). This is the main reason why overclocking capabilities on the 7000X3D processors are almost non-existent, and the processor's power limits are generally lower than their regular Ryzen 7000X counterparts. Attempting to dial up voltage kicks up the perfect storm for these processors.Igor's Lab posted a detailed analysis of the region of the Socket AM5 land-grid most susceptible to a burn-out in the above scenario. The central region of the LGA has 93 pins dedicated to the VDDCR power domain, dispersed in a mostly checkered pattern, toward the center of the land-grid. Igor isolated 6 of these VDDCR pins in particular, which are most prone to physical damage, as they are located in a region below the CCD that sees it sandwiched between the L3D (stacked 3D Vertical cache die), and the fiberglass substrate below. Apparently, AMD's thermal and electrical protection mechanisms aren't able to prevent a runaway overheating of the pins that causes the substrate to melt, deform, and bulge outward, resulting in irreversible damage to both the processor and the socket.
Meanwhile, AMD's motherboard partners are rushing to release UEFI BIOS updates for their entire lineups of motherboards, which enforce tighter limits on the VDDCR voltage. MSI is the first motherboard manufacturer with such updates. MSI, in a press statement, stated that it has redesigned automated overclocking for 7000X3D processors. "The BIOS now only supports negative offset voltage settings, which can reduce the CPU voltage only," the MSI statement to Tom's Hardware reads. "MSI Center also restricts any direct voltage and frequency adjustments, ensuring that the CPU won't be damaged due to over-voltage." On the other hand, the update introduces an automated overclocking feature called Enhanced Mode Boost, which optimizes PBO settings to improve boost frequency residency, without any manual voltage adjustments.
258 Comments on AMD Ryzen 7000X3D Processors Prone to Physical Damage with Voltage-assisted Overclocking, Motherboard Vendors Rush BIOS Updates with Voltage Limiters
der8auers CPU has the similar "marks" on it too, but it still works (for now)
AMD is notorious for half-assing AGESA firmware, and board vendors are notorious for half-assing implementing the firmware. Probably some bug in some combination of those two, or someone left out the safeguards.
Anyways, according to Ryzen Master, even with EXPO mode enabled, VDDCR SOC is set at 1.25 on my system.
He did mention that "he thinks" this is a temperature sensor issue that's not working. The CPU he had in his hands (7900X) got so hot the IHS melted off from being soldered to the cpu die. Gees.
This was on a Gigabyte board too.
I really have no idea how widespread these issues are but I dont see too many TechPowerup forum members reporting too many problems with their respective cpu's.
The board makers are low effort for a different reason - basically just developing one BIOS for a new AGESA release, making minor changes here and there, and then just Ctrl+V it across their entire lineup. Problems with a specific board? Cross that bridge when you get to it.
Hence why I keep telling people to stop salivating after the newest AGESA like it's the year's new iPhone. If it ain't broke and doesn't have something new you absolutely need, don't fix it. I think he did say he didn't want to kill the CPU lol. But he could've run some Prime95 large FFT, a memory heavy Ycruncher stress, or even some TM5 if he wanted to see more VDDCR_SOC power draw. For AM5 I feel like it's more obvious since VSOC is specifically focused on iGPU and UMC now, with most of Fabric being spun off into Vmisc (where it was previously all under SOC on AM4). It was just weird to me - manually set VSOC............in order to stress Vcore? lol
AMD prides themselves on extensive and comprehensive temperature monitoring with a lot of sensors (hundreds?) scattered throughout their CPUs and GPUs, so I feel like it would have to be a wider problem than just 1 malfunctioning temp sensor that Tctl/Tdie forgot to pick up.
Granted, to see the darkened pads, TPU AM5 users would all have to physically remove their CPUs for inspection on systems that worked just fine, but agreed, more investigation and less hysteria.
So what we know: all motherboard manufactures are affected & you have to have EXPO enabled. Voltage & thermal protection also doesn't kick in to protect the CPU from frying. Neither AMD or the motherboard manufactures didn't pick up the issue prior release and you have a limited number of affected folks. So it's either some unordinary circumstances comming together (with EXPO enabled), or AMD overguesstimated "save voltages" and/or the voltage/thermal protection is bugged in every BIOS, which is what I take from ASUS's statement.
Wondering, if it's unsafe to run EXPO 6000 with the pre-installed BIOS'es & you likely run into the issue someday, how do you sort this out? :confused: There are tons of people out there who do not monitor tech websites & never do BIOS updates. It's not like you have auto BIOS updates or a popup in Windows reminding you to update if you don't want to end up with a fried CPU.
These are my first non-Intel chips since a Cyrix chip way back in the day. Traditionally was never a fan of AMD, but have been very impressed with their Zen cores that I'd strongly consider staying with AMD for my next build.
Aside from maybe undervolting, I don't see much reason to fiddle with components like CPUs and GPUs these days. From my experience, the binning and algorithms they use to reach boost clocks are generally good enough that tweaks bring very little to the table.
As for the people complaining that the manufacturer is to blame for this - if you play outside the manufacturer's specs, you're playing with fire. Only have yourself to blame. If the motherboard manufacturer encourages this without proper fail-safes, that's on them.
So if they didn't sell any ASUS product since then there must have been big trouble in little China, lol.
(10) Highly speculative rambling about why Ryzen 7000 CPUs are dying. - YouTube
And knowing the kind of content that Buildzoid puts out, I'd have to have to say that there's a good chance that he knows what he's talking about.
Suffice it to say... if your processor isn't dead by now, you're in the clear.
This and Nvidia's fire-bursting donglegate are the two most interesting hardware issues this year.
During 1366 days, where actually was the first serious issues regarding RAM voltages and limitations, there was a general rule not to exceed uncore and core voltage delta for more than 0.5V.
Basically when doing under voltage for vcore you void some similar rule and simply fry the gates due to large voltage difference it is not designed to withstand. Basically it is vice versa.
The discussion about motherboard makers being retards, idiots, scammers is totally grounded here. I haven't had a board for years that did not violate some voltages when being set on AUTO. Everything is excessive high to prevent some RMA rates due to instabilities on stock. Same applies to XMP and EXPO. They should never have had the ability to toggle voltage settings, NEVER.
Also the BS regarding temperatures etc... it get short... it is a nail then, then the wires act as a tungsten bulb and simply fry the substrate. It is a common sight on GPU's when mosfets short out also.