Monday, August 5th 2024

Puget Systems Releases CPU Failure Report: AMD CPUs Achieve Higher Failure Rate Than Intel 13th and 14th Generation

A fleet of recent reports have highlighted stability issues affecting Intel's 13th and 14th-generation desktop processors, raising concerns among consumers and industry professionals. The problem, which has gained significant attention over the past few months, is related to the processors' physical degradation over time. Custom PC builder Puget Systems has shared insights from its experience with these processors, revealing a nuanced perspective on the issue. While it has observed an increase in CPU failures, particularly with the 14th-generation chips, its failure rates remain notably lower than those reported by some game development studios and cloud gaming providers, who have cited failure rates as high as 50%. An interesting observation is that Puget Systems recorded a higher failure rate with AMD Ryzen 5000 and Ryzen 7000 series than Intel's 13/14th generation, with most failures happening at Puget's shop rather than the "field" in customers' hands.

Puget Systems attributes their more modest failure rates of Intel processors to their conservative approach to power management settings. By adhering strictly to Intel's specifications and developing their own power settings that don't hurt performance, they've managed to mitigate some of the stability issues plaguing other users. Intel has acknowledged the problem and announced plans to release a microcode patch by mid-August, with extended warranty program. This update is expected to prevent further degradation but may not reverse existing damage. Despite the elevated failure rates, Puget Systems' data shows that the issue, while concerning, still needs to be at critical levels for their operations. The company reports that failure rates for 13th and 14th gen Intel processors, while higher than ideal, are still lower than those they experienced with Intel's 11th gen chips and some AMD Ryzen processors. In response to the situation, Puget Systems is taking several steps, including maintaining its current power management practices, promptly validating Intel's upcoming microcode update, and extending warranties for affected customers. Below, you can see failure rates by month, by Intel's Core generation, as well as by "shop" vs "field" testing.
Source: Puget Systems
Add your own comment

127 Comments on Puget Systems Releases CPU Failure Report: AMD CPUs Achieve Higher Failure Rate Than Intel 13th and 14th Generation

#51
DTheSleepless
I'm going to regret posting this, I just know it.

It's very easy to cry bias, and the irony is that a lot of the blame should fall on the news sites that are posting misleading clickbait headlines. This is really shoddy reporting on TPU's part and they should be ashamed of themselves. Because here's the part that's missing from the headline

[URL='https://www.techpowerup.com/325250/puget-systems-releases-cpu-failure-report-amd-cpus-achieve-higher-failure-rate-than-intel-13th-and-14th-generation']Puget Systems Releases CPU Failure Report: AMD CPUs Achieve Higher Failure Rate Than Intel 13th and 14th Generation[/URL]*

*When the Intel chips are using Puget's extremely conservative power and voltage settings instead of the stock spec Intel released. And also the Intel chips fail more frequently in the field. And Puget expects field failures to start piling up.

Several years ago, when I was still press, I reviewed boxes from Puget Systems, I was privy to some of their failure tracking data, and I had multiple conversations with people there. There's a reason Puget's blog is as good as it is, that they basically invented the standardized Adobe benchmark (among others), etc. Puget Systems is extremely data driven. If you look at the hardware configurations they offer vs. almost any other US-based integrator, you'll find very few options for brand or model on non-CPU components. Their validation processes are the strictest I've seen.

And then having jumped the fence and product managed PC components and systems, I'm willing to bet Puget was doing their own stringent testing and validation, probably alarmed at some of the things they saw, and having closed door conversations with Intel. Because that's what typically happens: if you see a problem, you work behind the scenes with the vendor to try and fix the problem rather than putting them on blast. Jon Bach being on Intel's advisory board means he gets to be in the room and say "hey, this is stupid, you guys shouldn't do this and here's why." Giving up that seat would achieve nothing.

Intel's been on the back foot for a few generations now, throwing power specs out the window in order to get the coveted top performance spots in reviews. AMD did this to a lesser extent with Ryzen 7000; almost doubling 7950X's stock power consumption got them enough extra percentage points in performance to eke out wins. And we've seen that if you cap AMD and Intel TDP to a more reasonable power limit like 125W, AMD barely sheds any performance while Intel takes a sizable hit.

Everyone wants to cry nefarious and point an accusatory finger at Puget, but to me the most disappointing thing has been the clickbait headlines from otherwise reputable news sources and rampant conspiracy-mongering because most people don't have visibility into how this stuff actually works. Puget's doing their due diligence here; they obviously don't want sales harmed, and they want to take care of their customers, so they're releasing a statement and releasing data. FFS, they don't even mention AMD in their headline or in the first few paragraphs.
Posted on Reply
#52
64K
For me it's now a wait and see even with CPUs. Let the early adopters be the Beta testers. Might not do any good because it took a long time for the Intel issues to come to light but it couldn't hurt to be a little cautious anyway.
Posted on Reply
#53
Bwaze
WirkoProcessors are only guaranteed to run at their base clock, right?
That was the bottom line when Der8auer showed in his poll that many Ryzen 3000 CPUs don't achieve their rated boost clock - because it is "up to", "depending" etc. AMD has then prepared microcode that remedied the situation somehow, but nobody checked if the subsequent bios changes broke that, if later generations share the same problem...
Posted on Reply
#54
Wirko
Chrispy_I'm convinced newer CPUs are either just being pushed too close to their silicon limits or modern, smaller process nodes are causing problems on a scale that we never used to see with the old double-digit nanometre nodes.
You might be right here. And people keep inventing ways to compress stuff more. Just wait for QLC DRAM!
Posted on Reply
#55
Nhonho
What an ugly thing, how low.
Instead of fixing its own mistakes, Intel is pointing out AMD's false mistakes.
Posted on Reply
#56
Chaitanya
napataThe graph uses percentages...

They follow Intel's guidelines as best as they can. That's what Intel should mandate from the start instead of auto-OCing all their CPUs from the start.
That reminds me how "press" and "scientists" used relative percentages to scare monger about saturated fats while absolute values were not that high(2% risk with low fat diet vs 4% risk with saturated fats in diet of developing high cholesetrol). So yeah want to see absolute figures.

Posted on Reply
#57
azrael
WirkoThey don't seem to be married to Intel. On all of their product pages, AMD comes first. But, as they aren't AMD exclusive, they certainly make most of their income by selling Intel. Nobody ever got fired for buying Intel - that still holds true, I believe.
I guess the letter 'A' comes before the letter 'I'. :D
WirkoIf anyone needs more horror stories: Puget vice president is an MBA!
He's Made By AMD? The shock! The horror! :p
Posted on Reply
#58
john_
NoyandAs a matter of fact, they did publish failure rates before. But whatever happened with 11th gen didn't blew up as much as the raptor lake stuff, Puget was probably forced to make an article about that people their customers who choosed an Intel workstation crapped their pants. (especially since RPL was the recommended CPU for video editing, motion design, and photo editing this time around)

AHA! Interesting. I guess in that older chart 11th gen was failing in shop, so customers weren't really affected. I also see 5000 going from 0.77% to close to 2% in 3 years which is not bad. 1.3% failures in 2-3 years are not bad. The same can be said about the 11th gen, so even that series seems to be fine with shop fails probably meaning that maybe Puget doesn't do everything right. Have we considered this? Puget also not being the best system builder, that's why shop failures are high?
13th and 14th gen are different because we expect failure rates in the field to skyrocket.
Posted on Reply
#59
R0H1T
WirkoRight. Upget, or any other vendor, or Intel themself. Intel extended the warranty but didn't clarify how much degradation is necessary to consider a CPU bad, and eligible for an RMA. Processors are only guaranteed to run at their base clock, right?
You got got :pimp:
Posted on Reply
#60
Nater
Without even reading this thread I can tell you Puget are Intel homers.
Posted on Reply
#61
MikeSnow
WirkoRight. Upget, or any other vendor, or Intel themself. Intel extended the warranty but didn't clarify how much degradation is necessary to consider a CPU bad, and eligible for an RMA. Processors are only guaranteed to run at their base clock, right?
Wrong. In any case, the CPUs choose the clocks, not the user, depending on the load, and by default on Intel CPUs it goes way higher than base (the base frequency is the average frequency the CPU can sustain in a specific Intel benchmark at TDP, and Intel doesn't recommend using the TDP as a power limit).

The affected Intel CPUs become unstable at those high frequencies, frequencies that they boost to automatically, so the base frequency is irrelevant. Anyway, from my RMA experience for my 13900KF, Intel accepts the RMA if you experience instability with a recent MB BIOS and the Intel defaults enabled in the BIOS.

As for people talking about CPUs being stable in stress tests at low voltages, that's normal. In a stress test usually all cores are busy, so you hit the power limit, that limits the maximum frequency, and lower frequencies don't need high voltages. The best way to test for stability issues at high frequencies is to run single core benchmarks/tests, that are not limited by power limits.

Later edit: I would refrain from running any single core tests, including RAM tests, until the new microcode is released, as running such tests might increase the risk of damage to your CPU.
Posted on Reply
#62
efikkan
The graph of CPU-failures per month, does that mean less than 15 total failures, or are we talking about thousands? Because if so, it seems like they are building about a few hundred systems per month, which isn’t a whole lot, but still makes it a valid data point.
AsRockYes lacks context, for example might be using 10 times more AMD CPU's than Intels.
The first graph shows failure reate, which is a relative number ;)
ymdhisNote the "shop" vs "field" difference. AMD CPUs fail more often in their shop when they are trying to apply their own overclocking or whatever, Intel CPUs fail more in the field ie. when used by users. AMD has a lot lower failure rates in the field, where they fail more is when Puget Systems are setting them up.
What about the elephant in the room; high failure rates in the shop. If I where having such high failure rates I would investigate whether these are DoA, unstable or damaged during assembly.
ymdhisSo yeah, damage control.
Why are many of you down the conspiracy route? At the very least there is clear evidence of confirmation bias here.
We’ve had several claims of >20%, 50% or even close to 100% failure rates without any real data to back it up. This is at least one data point, although it’s not enough to draw conclusions. What we need are data from the large systems integrators like Dell, HP, Lenovo, etc. or at the very least the medium sized ones. Also keep in mind that systems integrators typically get their hardware in large batches, so if there were quality issues related to specific production batches we would see a pattern with multiple data points.

Additionally, regarding the point above about "shop failures", those failures should be unrelated to the abnormal wear issue, whether it’s a production issue or user error.
Posted on Reply
#63
Darmok N Jalad

While it's not a lot to go on, what this graph tells me is that Puget Systems had no problems with Canon Lake. They then saw a statistically significant jump (from under 1% to 7.5%) around the time of Rocket Lake, which was really the first Intel generation to go bananas with power to the CPU. In response, Puget Systems probably learned a lesson and altered their power management strategy at the launch of Adler Lake, which appears to have drastically reduced failure rates (7.5% down to 1%). Then, all of a sudden, they saw a noticeable jump in failures with Raptor Lake (1% to 2.5%), despite their power management efforts. So if we're really looking at things, even a "properly tuned" Raptor Lake system is failing at higher than Adler Lake, and these systems are much newer and we're just now starting to see the failures present themselves. The reports are that Raptor Lake starts failing after around 6 months of active service. Adler Lake is all in the field by now and has been in operation far longer than Raptor Lake. That makes the jumps concerning and it is apparently just the beginning. It also means the microcode patch may not mitigate the failures at all since Puget Systems is properly configuring their rigs.

I can't make an assessment on their Ryzen rates because they don't talk at all about volume or how they set them up. I'm already assuming that their Intel volume remains constant from gen 10 to 14.
Posted on Reply
#64
Pepamami
Ryzen 5000 field fail usually can be ez fixed by a positive CurveOptimzier setting, I had few, with a SUS origin, both got fixed with a positive Curve on the failing core. And none of my cpus from PGS origin failed so far.

Ryzen 7000 field fail is super rare, compared to 13th Gen, where 14th just too new to have it. Just wait for 2-3 more years.

But when I buy Ryzen 7000, I prefer to select PGY origin, if I am able too.
Posted on Reply
#65
LittleBro
I guess I'm a bit late to the party, but anyway, few questions to consider:

1. What was so wrong with 11th gen? I can't recall anything serious but I might be wrong.

2. What is considered a failure? Unstable chip while undervolting? Computational error?

3. What is considered a conservative power setup for their chips? Which settings are touched?

4. Why there are no volumes provided?

5. Why there is so high "shop" failure rate for AMD CPUs? They don't know how to install the CPU or do they consider a failure when it does not support particular RAM speed that was supported on Intel platform?

6. Why is chronological and SKU overview of AMD CPU's failure rate missing?

Ultimately, their statistics (as presented in graphs) are not telling much. Volumes are missing and since they're altering default settings, those CPUs cannot be put in the same consideration pool as those which are run using Intel recommended settings. Same applies to AMD. If you alter settings such as voltages, TDP or frequencies, you no longer are using that chip within recommended/default settings, meaning the chip might not work properly that way because it was not designed and tested to work that way. If a chip is instable in such case, it must not be considered a failure as it is not being run within recommended/default settings.
PepamamiRyzen 5000 field fail usually can be ez fixed by a positive CurveOptimzier setting, I had few, with a SUS origin, both got fixed with a positive Curve on the failing core. And none of my cpus from PGS origin failed so far.

Ryzen 7000 field fail is super rare, compared to 13th Gen, where 14th just too new to have it. Just wait for 2-3 more years.

But when I buy Ryzen 7000, I prefer to select PGS origin, if I am able too.
So you've experienced Ryzen 5000 failure in which the chip was unstable using default settings?

I built several workstations using Zen 3 chips and they were all using high voltages. I remember one machine with 5600X OCed to 4.8 GHz on all cores. Before I undervolted it was a disaster. Pushing 1.42+V was pretty common in heavy workload. The CPU was perfectly stable at 1.337V after I undervolted it and it still is.

AMD is pushing a lot of voltage to Zen 3 chips, that's why I think that instability on default settings is for Zen 3 quite rare.
Posted on Reply
#66
Nin
This many failures? It seems awfully high, for all CPUs. For example, the Ryzen 7000 CPUs have a 4% shop failure rate according to the first picture. That would mean that out of every 25 customers who buy a Ryzen 7000 CPU, one buys a broken product. One out of every 25. Isn't that way too many?
Posted on Reply
#67
Frozoken
ZubasaBecause there are Raptor Lake CPUs that have degraded even when ran well within Intel's spec.
Thats because the "running within spec" was like worrying about a scrape on your knee when youve been shot. They were still running at over 1.6 volts even with a 253w power limit which is a 1000x bigger problem then running high wattages.
Posted on Reply
#68
mechtech
I thought mobo makers were caught also overvolting AMD cpus as well?
Posted on Reply
#69
Chrispy_
PepamamiRyzen 5000 field fail usually can be ez fixed by a positive CurveOptimzier setting, I had few, with a SUS origin, both got fixed with a positive Curve on the failing core. And none of my cpus from PGS origin failed so far.
What's your preferred tool to identify the failing core? Most of our 5000-series workstations are out of warranty now so being able to curve offset just the problem cores rather than the quick-and-easy all-core +5 would be better, but's it's always a time/effort trade-off.
mechtechI thought mobo makers were caught also overvolting AMD cpus as well?
That was an Asus AM5 thing, if you're referring to the burnt CPUs and melted sockets.
Posted on Reply
#70
Darmok N Jalad
Rocket Lake was a mess of a release, it was supposed to be 10nm, but had to be redesigned to 14nm, went backwards in core count and up in power consumption over Canon Lake. I don’t think it had that long of a shelf life, but was more of a “we had it in the roadmap so here it is” kinda launch. I think it even launched with an immediate price cut. We don’t talk about Rocket Lake because it wasn’t a great product to start with and everyone moved on from it quickly.
Posted on Reply
#71
Bobaganoosh
A lot of people here seem to be missing a huge factor here with regards to Puget on "why release this statement now?" and "why include AMD?". They make professional workstations, mostly for heavy CAD users (3D mechanical design, physics simulations (optical, thermal, stress, etc.), rendering, and a lot more), not gamers. These are likely going to sit in some Engineer's (or artist, etc.) cubicle and a lot of these will never be cleaned, updated (IT at many corporations are now reducing driver updates and Windows updates, especially on dedicated CAD machines if they're shared resources as WU tend to cause as many problems as they fix and drivers get stuck on whatever Solidworks (or insert CAD tool here) has certified for that version of Windows), or monitored the way a lot of enthusiast gamers will. They have to just sit there and work...indefinitely.

The end result of the statement above is that Puget (and similar companies) are incentivized to do a lot of testing on the hardware they're sending out so they can do whatever possible to reduce failures. This is why a lot of CPUs are super power-limited right out of the box (reduce heat and long-term wear), whether AMD or Intel. I have a Lenovo workstation for example (at work) with a Threadripper in it that could most certainly run at a higher power level, but it's been locked and reduced so that it never will. Most high-threaded CAD workloads don't care too much about clock-speed anyway, it's just how many cores and generational IPC improvements that help.

Add to that, right now, when all this coverage is happening, this company in particular, that as others have stated sell a lot of Intel rigs, has undoubtedly received many questions from their customers and potential customers. Puget has more published test data and processes than most companies like this so it does not surprise me at all that they'd try and get out some data. That's "why now?" and if you are a customer asking about problems with Intel, your next question is obviously "what about AMD?!"...

Yes, whether AMD or Intel, they're going to set their own BIOS settings to minimize heat and long-term wear on all systems. These are not going to be overclocked gaming systems.

TLDR: This means that the data is not necessarily applicable to gamers, anybody running their motherboard's default settings, or anybody who's overclocking their PC.
Posted on Reply
#72
Pepamami
Chrispy_What's your preferred tool to identify the failing core? Most of our 5000-series workstations are out of warranty now so being able to curve offset just the problem cores rather than the quick-and-easy all-core +5 would be better, but's it's always a time/effort trade-off.
First of all, its my own experience, you own experience can be different, thats why I say "usually", but not always, so I am not guarantee, but in my cases, and my friends cases it was a silly voltage regulation on Cores, Soc/Uncore or Memory.

For example, in DDR4 memory cases I had to add extra voltage even to a generic 1.2v ram modules (running it on 1.28v instead of 1.2v). My threadripper 2990WX was freezing otherwise.

To detect bad cores I am using a simple "CoreCycler" tool, you can get it on github. I am using it with a default settings, but change the test time from 6 min to 60 min.
And you test cores in "preferred-cores" order (hwinfo can show u that order), but I think default random will do too.
Also this test can fail randomly if u have problems with memory (bad timings, or need more voltage, like 1.38 instead of basic 1.35), or when SoC/uncore voltage is "bad".
LittleBroSo you've experienced Ryzen 5000 failure in which the chip was unstable using default settings?
yes, but it was on day 1 of using it, single-core PRIME95 did not pass on it, always failed on first 2 "fast" cores. And I was too lazy to RMA it, because +5/+7 on 1 core and +1/+4 on other core did not change performance noticeably. But this P95 fail can ez cast a blue screen in future, so I classify it as a "field fail", not "shop fail". And I think it has do to with a crappy SUS origin, where they mess up with a proper testing, while PGS origin has no issues for me.
Posted on Reply
#73
evernessince
Outback BronzeSo, if 11th gen is so bad, how come I never heard about it like the so called famous 13/14th gen??
These numbers are likely derived from a small sample size hence the inconsistencies. Even their "low" failure rate CPU families are about twice the industry average. They aren't useful for anyone but Puget. You could have easily titled this article "Puget shows 11th gen failure issues" and it would have been just as misleading. The reason we got the headline we got is for the clicks, plain and simple.
BwazeI wonder what prompted Puget Systems to check whether the base motherboard settings adhere to Intel documented base CPU settings, and then to change the motherboard settings when they found out they didn't? Was the reason absolute stability, or did they suspect any increased power would shorten life expectancy of the CPU?

Intel apologists were quick to point out we should disregard the reported high failure rates from companies that used these consumer CPUs in render farms, servers - that this is just product misuse, there is a reason why companies sell server, workstation lines of CPUs. Puget Systems builds and tests workstations just from such products, consumer CPUs - isn't this info invalid too? Or is this now perfectly acceptable, because the end line is "AMD fails even more", especially when you bury the point that Intel CPUs are beginning to show elevated failure rates later in their life?
Intel's official documentation for the 13th and 14th gen actually recommend against the base line profile.



This is because if Intel recommended the base line profile as default they would loose a massive amount of performance. The result is that the baseline profile isn't actually the baseline profile and there really is no singular recommendation from Intel in regards to what's safe. (More info on this in GN's latest video on the topic).

Their original article details why they manually go through and validate their own power settings, it's just that it's been taken out of context by tech news websites.
JWNoctisOkay, story time: A quite similar A versus B comparision in another community was with Airbus and Boeing, and the two had never quite, directly or indirectly, called each other's aircrafts unsafe, even in the aftermaths of tragedies like the AF447, and the MAX accidents. Notably, Boeing's saga with the MAX bear some resemblance to Intel's current predicament, except for the actual loss of life.

This one may bear some comparison, depending on how much further (and/or lower) the recriminations go. And whether the story would be corroborated elsewhere.
This data isn't likely to be corroborated because it also implicates the 11th gen as having a failure rate of nearly 2 gens combined, which I have seen no reports of. Puget's failure rates in general are twice the industry average. I said it before but these numbers are highly specific to puget and they aren't intended to demonstrate high AMD failure rates.
WirkoNobody ever got fired for buying Intel - that still holds true, I believe.
I very much doubt this hold true after this recent fiasco. 50% failure rate on your gaming servers will absolutely get you fired when the competition is 1.2%.
Darmok N JaladI can't make an assessment on their Ryzen rates because they don't talk at all about volume or how they set them up. I'm already assuming that their Intel volume remains constant from gen 10 to 14.
Correct and the original article wasn't intended to to draw any conclusions on AMD rates. It's the tech Media that is trying to do so for the clicks.
Posted on Reply
#74
thesmokingman
Puget is clearly full of it. Why even bother dropping your name into the conversation with 2% vs 4% failure rates. You're not even in the same ballpark in this conversation when others are seeing 50% to 100% failure rate. Quick lets redirect to AMD, zomg 2% higher failure rate, no one will notice.

Oh that's right, you're incentivized lmao.
Posted on Reply
#75
AusWolf
Darmok N Jalad
While it's not a lot to go on, what this graph tells me is that Puget Systems had no problems with Canon Lake. They then saw a statistically significant jump (from under 1% to 7.5%) around the time of Rocket Lake, which was really the first Intel generation to go bananas with power to the CPU. In response, Puget Systems probably learned a lesson and altered their power management strategy at the launch of Adler Lake, which appears to have drastically reduced failure rates (7.5% down to 1%). Then, all of a sudden, they saw a noticeable jump in failures with Raptor Lake (1% to 2.5%), despite their power management efforts. So if we're really looking at things, even a "properly tuned" Raptor Lake system is failing at higher than Adler Lake, and these systems are much newer and we're just now starting to see the failures present themselves. The reports are that Raptor Lake starts failing after around 6 months of active service. Adler Lake is all in the field by now and has been in operation far longer than Raptor Lake. That makes the jumps concerning and it is apparently just the beginning. It also means the microcode patch may not mitigate the failures at all since Puget Systems is properly configuring their rigs.

I can't make an assessment on their Ryzen rates because they don't talk at all about volume or how they set them up. I'm already assuming that their Intel volume remains constant from gen 10 to 14.
If your theory is right, and they really started configuring their own power limits after 11th gen, then it makes the data even more skewed and irrelevant.
Posted on Reply
Add your own comment
Nov 21st, 2024 11:45 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts