Monday, August 5th 2024
Puget Systems Releases CPU Failure Report: AMD CPUs Achieve Higher Failure Rate Than Intel 13th and 14th Generation
A fleet of recent reports have highlighted stability issues affecting Intel's 13th and 14th-generation desktop processors, raising concerns among consumers and industry professionals. The problem, which has gained significant attention over the past few months, is related to the processors' physical degradation over time. Custom PC builder Puget Systems has shared insights from its experience with these processors, revealing a nuanced perspective on the issue. While it has observed an increase in CPU failures, particularly with the 14th-generation chips, its failure rates remain notably lower than those reported by some game development studios and cloud gaming providers, who have cited failure rates as high as 50%. An interesting observation is that Puget Systems recorded a higher failure rate with AMD Ryzen 5000 and Ryzen 7000 series than Intel's 13/14th generation, with most failures happening at Puget's shop rather than the "field" in customers' hands.
Puget Systems attributes their more modest failure rates of Intel processors to their conservative approach to power management settings. By adhering strictly to Intel's specifications and developing their own power settings that don't hurt performance, they've managed to mitigate some of the stability issues plaguing other users. Intel has acknowledged the problem and announced plans to release a microcode patch by mid-August, with extended warranty program. This update is expected to prevent further degradation but may not reverse existing damage. Despite the elevated failure rates, Puget Systems' data shows that the issue, while concerning, still needs to be at critical levels for their operations. The company reports that failure rates for 13th and 14th gen Intel processors, while higher than ideal, are still lower than those they experienced with Intel's 11th gen chips and some AMD Ryzen processors. In response to the situation, Puget Systems is taking several steps, including maintaining its current power management practices, promptly validating Intel's upcoming microcode update, and extending warranties for affected customers. Below, you can see failure rates by month, by Intel's Core generation, as well as by "shop" vs "field" testing.
Source:
Puget Systems
Puget Systems attributes their more modest failure rates of Intel processors to their conservative approach to power management settings. By adhering strictly to Intel's specifications and developing their own power settings that don't hurt performance, they've managed to mitigate some of the stability issues plaguing other users. Intel has acknowledged the problem and announced plans to release a microcode patch by mid-August, with extended warranty program. This update is expected to prevent further degradation but may not reverse existing damage. Despite the elevated failure rates, Puget Systems' data shows that the issue, while concerning, still needs to be at critical levels for their operations. The company reports that failure rates for 13th and 14th gen Intel processors, while higher than ideal, are still lower than those they experienced with Intel's 11th gen chips and some AMD Ryzen processors. In response to the situation, Puget Systems is taking several steps, including maintaining its current power management practices, promptly validating Intel's upcoming microcode update, and extending warranties for affected customers. Below, you can see failure rates by month, by Intel's Core generation, as well as by "shop" vs "field" testing.
127 Comments on Puget Systems Releases CPU Failure Report: AMD CPUs Achieve Higher Failure Rate Than Intel 13th and 14th Generation
It's very easy to cry bias, and the irony is that a lot of the blame should fall on the news sites that are posting misleading clickbait headlines. This is really shoddy reporting on TPU's part and they should be ashamed of themselves. Because here's the part that's missing from the headline
[URL='https://www.techpowerup.com/325250/puget-systems-releases-cpu-failure-report-amd-cpus-achieve-higher-failure-rate-than-intel-13th-and-14th-generation']Puget Systems Releases CPU Failure Report: AMD CPUs Achieve Higher Failure Rate Than Intel 13th and 14th Generation[/URL]*
*When the Intel chips are using Puget's extremely conservative power and voltage settings instead of the stock spec Intel released. And also the Intel chips fail more frequently in the field. And Puget expects field failures to start piling up.Several years ago, when I was still press, I reviewed boxes from Puget Systems, I was privy to some of their failure tracking data, and I had multiple conversations with people there. There's a reason Puget's blog is as good as it is, that they basically invented the standardized Adobe benchmark (among others), etc. Puget Systems is extremely data driven. If you look at the hardware configurations they offer vs. almost any other US-based integrator, you'll find very few options for brand or model on non-CPU components. Their validation processes are the strictest I've seen.
And then having jumped the fence and product managed PC components and systems, I'm willing to bet Puget was doing their own stringent testing and validation, probably alarmed at some of the things they saw, and having closed door conversations with Intel. Because that's what typically happens: if you see a problem, you work behind the scenes with the vendor to try and fix the problem rather than putting them on blast. Jon Bach being on Intel's advisory board means he gets to be in the room and say "hey, this is stupid, you guys shouldn't do this and here's why." Giving up that seat would achieve nothing.
Intel's been on the back foot for a few generations now, throwing power specs out the window in order to get the coveted top performance spots in reviews. AMD did this to a lesser extent with Ryzen 7000; almost doubling 7950X's stock power consumption got them enough extra percentage points in performance to eke out wins. And we've seen that if you cap AMD and Intel TDP to a more reasonable power limit like 125W, AMD barely sheds any performance while Intel takes a sizable hit.
Everyone wants to cry nefarious and point an accusatory finger at Puget, but to me the most disappointing thing has been the clickbait headlines from otherwise reputable news sources and rampant conspiracy-mongering because most people don't have visibility into how this stuff actually works. Puget's doing their due diligence here; they obviously don't want sales harmed, and they want to take care of their customers, so they're releasing a statement and releasing data. FFS, they don't even mention AMD in their headline or in the first few paragraphs.
Instead of fixing its own mistakes, Intel is pointing out AMD's false mistakes.
13th and 14th gen are different because we expect failure rates in the field to skyrocket.
The affected Intel CPUs become unstable at those high frequencies, frequencies that they boost to automatically, so the base frequency is irrelevant. Anyway, from my RMA experience for my 13900KF, Intel accepts the RMA if you experience instability with a recent MB BIOS and the Intel defaults enabled in the BIOS.
As for people talking about CPUs being stable in stress tests at low voltages, that's normal. In a stress test usually all cores are busy, so you hit the power limit, that limits the maximum frequency, and lower frequencies don't need high voltages. The best way to test for stability issues at high frequencies is to run single core benchmarks/tests, that are not limited by power limits.
Later edit: I would refrain from running any single core tests, including RAM tests, until the new microcode is released, as running such tests might increase the risk of damage to your CPU.
We’ve had several claims of >20%, 50% or even close to 100% failure rates without any real data to back it up. This is at least one data point, although it’s not enough to draw conclusions. What we need are data from the large systems integrators like Dell, HP, Lenovo, etc. or at the very least the medium sized ones. Also keep in mind that systems integrators typically get their hardware in large batches, so if there were quality issues related to specific production batches we would see a pattern with multiple data points.
Additionally, regarding the point above about "shop failures", those failures should be unrelated to the abnormal wear issue, whether it’s a production issue or user error.
While it's not a lot to go on, what this graph tells me is that Puget Systems had no problems with Canon Lake. They then saw a statistically significant jump (from under 1% to 7.5%) around the time of Rocket Lake, which was really the first Intel generation to go bananas with power to the CPU. In response, Puget Systems probably learned a lesson and altered their power management strategy at the launch of Adler Lake, which appears to have drastically reduced failure rates (7.5% down to 1%). Then, all of a sudden, they saw a noticeable jump in failures with Raptor Lake (1% to 2.5%), despite their power management efforts. So if we're really looking at things, even a "properly tuned" Raptor Lake system is failing at higher than Adler Lake, and these systems are much newer and we're just now starting to see the failures present themselves. The reports are that Raptor Lake starts failing after around 6 months of active service. Adler Lake is all in the field by now and has been in operation far longer than Raptor Lake. That makes the jumps concerning and it is apparently just the beginning. It also means the microcode patch may not mitigate the failures at all since Puget Systems is properly configuring their rigs.
I can't make an assessment on their Ryzen rates because they don't talk at all about volume or how they set them up. I'm already assuming that their Intel volume remains constant from gen 10 to 14.
Ryzen 7000 field fail is super rare, compared to 13th Gen, where 14th just too new to have it. Just wait for 2-3 more years.
But when I buy Ryzen 7000, I prefer to select PGY origin, if I am able too.
1. What was so wrong with 11th gen? I can't recall anything serious but I might be wrong.
2. What is considered a failure? Unstable chip while undervolting? Computational error?
3. What is considered a conservative power setup for their chips? Which settings are touched?
4. Why there are no volumes provided?
5. Why there is so high "shop" failure rate for AMD CPUs? They don't know how to install the CPU or do they consider a failure when it does not support particular RAM speed that was supported on Intel platform?
6. Why is chronological and SKU overview of AMD CPU's failure rate missing?
Ultimately, their statistics (as presented in graphs) are not telling much. Volumes are missing and since they're altering default settings, those CPUs cannot be put in the same consideration pool as those which are run using Intel recommended settings. Same applies to AMD. If you alter settings such as voltages, TDP or frequencies, you no longer are using that chip within recommended/default settings, meaning the chip might not work properly that way because it was not designed and tested to work that way. If a chip is instable in such case, it must not be considered a failure as it is not being run within recommended/default settings. So you've experienced Ryzen 5000 failure in which the chip was unstable using default settings?
I built several workstations using Zen 3 chips and they were all using high voltages. I remember one machine with 5600X OCed to 4.8 GHz on all cores. Before I undervolted it was a disaster. Pushing 1.42+V was pretty common in heavy workload. The CPU was perfectly stable at 1.337V after I undervolted it and it still is.
AMD is pushing a lot of voltage to Zen 3 chips, that's why I think that instability on default settings is for Zen 3 quite rare.
The end result of the statement above is that Puget (and similar companies) are incentivized to do a lot of testing on the hardware they're sending out so they can do whatever possible to reduce failures. This is why a lot of CPUs are super power-limited right out of the box (reduce heat and long-term wear), whether AMD or Intel. I have a Lenovo workstation for example (at work) with a Threadripper in it that could most certainly run at a higher power level, but it's been locked and reduced so that it never will. Most high-threaded CAD workloads don't care too much about clock-speed anyway, it's just how many cores and generational IPC improvements that help.
Add to that, right now, when all this coverage is happening, this company in particular, that as others have stated sell a lot of Intel rigs, has undoubtedly received many questions from their customers and potential customers. Puget has more published test data and processes than most companies like this so it does not surprise me at all that they'd try and get out some data. That's "why now?" and if you are a customer asking about problems with Intel, your next question is obviously "what about AMD?!"...
Yes, whether AMD or Intel, they're going to set their own BIOS settings to minimize heat and long-term wear on all systems. These are not going to be overclocked gaming systems.
TLDR: This means that the data is not necessarily applicable to gamers, anybody running their motherboard's default settings, or anybody who's overclocking their PC.
For example, in DDR4 memory cases I had to add extra voltage even to a generic 1.2v ram modules (running it on 1.28v instead of 1.2v). My threadripper 2990WX was freezing otherwise.
To detect bad cores I am using a simple "CoreCycler" tool, you can get it on github. I am using it with a default settings, but change the test time from 6 min to 60 min.
And you test cores in "preferred-cores" order (hwinfo can show u that order), but I think default random will do too.
Also this test can fail randomly if u have problems with memory (bad timings, or need more voltage, like 1.38 instead of basic 1.35), or when SoC/uncore voltage is "bad". yes, but it was on day 1 of using it, single-core PRIME95 did not pass on it, always failed on first 2 "fast" cores. And I was too lazy to RMA it, because +5/+7 on 1 core and +1/+4 on other core did not change performance noticeably. But this P95 fail can ez cast a blue screen in future, so I classify it as a "field fail", not "shop fail". And I think it has do to with a crappy SUS origin, where they mess up with a proper testing, while PGS origin has no issues for me.
This is because if Intel recommended the base line profile as default they would loose a massive amount of performance. The result is that the baseline profile isn't actually the baseline profile and there really is no singular recommendation from Intel in regards to what's safe. (More info on this in GN's latest video on the topic).
Their original article details why they manually go through and validate their own power settings, it's just that it's been taken out of context by tech news websites. This data isn't likely to be corroborated because it also implicates the 11th gen as having a failure rate of nearly 2 gens combined, which I have seen no reports of. Puget's failure rates in general are twice the industry average. I said it before but these numbers are highly specific to puget and they aren't intended to demonstrate high AMD failure rates. I very much doubt this hold true after this recent fiasco. 50% failure rate on your gaming servers will absolutely get you fired when the competition is 1.2%. Correct and the original article wasn't intended to to draw any conclusions on AMD rates. It's the tech Media that is trying to do so for the clicks.
Oh that's right, you're incentivized lmao.