Tuesday, July 23rd 2024

Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon

Long-term reliability issues continue to plague Intel's 13th Gen and 14th Gen Core desktop processors based on the "Raptor Lake" microarchitecture, with users complaining that their processors have become unstable with heavy processing workloads, such as games. This includes the chips that have minor levels of performance tuning or overclocking. Intel had earlier isolated many of these stability issues to faulty CPU core frequency boosting algorithms, which it addressed through updates to the processor microcode that it got motherboard- and prebuilt manufacturers to distribute as UEFI firmware updates. The company has now come out with new findings of what could be causing these issues.

In a statement Intel posted on its website on Monday (22/07), the company said that it has been investigating the processors returned to it by users under warranty claims (which it has been replacing under the terms of its warranty). It has found that faulty processor microcode has been causing the processors to operate under excessive core voltages, leading to their structural degradation over time. "We have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor."
Modern processor power management runs on an intricate clockwork of collaboration between software, firmware, and hardware, with the software constantly telling the hardware what levels of performance it wants, and the hardware managing its power- and thermal budgets by rapidly altering the power and clock speeds of the various components, such as CPU cores, caches, fabric, and other on-die components. A faulty collaboration between any of the three key components could break this clockwork, as has happened in this case.

Intel is releasing yet another microcode update to its 13th- and 14th Gen Core processors, which will address not just the faulty boosting algorithm issue the company unearthed in June, but also the faulty voltage management the company discovered now. This new microcode should be released some time around mid-August to partners (motherboard manufacturers and PC OEMs), who will then need to validate it on their machines, before passing it along to end-users as UEFI firmware updates.
Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages. We are continuing validation to ensure that scenarios of instability reported to Intel regarding its Core 13th/14th Gen desktop processors are addressed. Intel is currently targeting mid-August for patch release to partners following full validation. Intel is committed to making this right with our customers, and we continue asking any customers currently experiencing instability issues on their Intel Core 13th/14th Gen desktop processors reach out to Intel Customer Support for further assistance, the company stated.
It's important to note here, that the microcode update won't fix the issues on processors already experiencing instability, but prevent it on chips that aren't. The instability is caused by irreversible physical degradation of the chip. These chips will, of course, be covered under warranty.

Meanwhile, an interesting issue has come to light, which that some of Intel's processors built on the Intel 7 node are experiencing chemical oxidation of the die as they age. Intel responded to this, stating that it had discovered the oxidation manufacturing issues in 2023, and addressed it. The company also stated that die oxidation is not related to the stability issues it is embattled with.
We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue, the company stated.
If you feel your chip might be affected, you can file for an RMA.
Sources: Intel Community, Intel (Reddit)
Add your own comment

387 Comments on Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon

#251
thesmokingman
TumbleGeorge"Modern farm" unreal engine division:. 50% failed Intel 13/14 gen. Wow.
Soon all the devs are gonna be moving on to redder pastures. Good job Intel.
For those curious at work our failure rate for our 13900k and 14900k machines is about 50% so far, any new machine builds going to be 9950x's, production environments need reliability
Posted on Reply
#252
OneMoar
There is Always Moar
nice try on the explain-it-away there Intel
But nobody is that stupid you pushed your silicon beyond its reasonable limits to compete with amd and it came back to bite you
now you have sold a bunch of cpus to people that are suddenly going to get a lot slower

I hope you enjoy class actions because there is one headed your way
Posted on Reply
#253
Tomorrow
Vincero5800X didn't really offer a lot over the 5800 (less than 2% difference in single-threaded performance, and nearly 8% in multi-threaded) but with a 60% higher TDP... and the same was true further down the product stack.
5800X: Nov 5th, 2020, Retail
5800: Jan 12th, 2021, OEM

5800 was released several months later as 65W OEM exclusive model.
Posted on Reply
#254
phanbuey
OneMoarnice try on the explain-it-away there Intel
But nobody is that stupid you pushed your silicon beyond its reasonable limits to compete with amd and it came back to bite you
now you have sold a bunch of cpus to people that are suddenly going to get a lot slower

I hope you enjoy class actions because there is one headed your way
Im surprised this isn't already happening.
Posted on Reply
#255
trparky
OneMoarBut nobody is that stupid you pushed your silicon beyond its reasonable limits to compete with amd and it came back to bite you
That's how I feel as well. Intel is far beyond the point where they should have gone back to the drawing board to design a whole new microarchitecture from the ground up. If AMD can do it, why can't Intel? Seriously, Intel makes far more money than AMD makes.
Posted on Reply
#256
Nhonho
Intel CPU users: just buy a new PC with a 15th generation Intel CPU and the problem will be solved.

When problems arise with 15th generation Intel CPUs, just buy a new PC with a 16th or 17th generation Intel CPU and the problem will be solved.

Do it exactly this way and you won't have any problems.
Posted on Reply
#257
ratirt
mkppoGN just posted results from the analysis lab and denied RMA's. Edit: They're still awaiting results from the lab, but changes nothing.


So uh..sell dodgy chips to server farms, deny RMA even though they knew about the oxidization issue at the time then release a half assed statement two years after the chips launch that the chips have an issue, sorry wait multiple issues.

So if intel apparently 'fixed' this oxidization issue which apparently plagued early 13th gen batches, obviously they knew about it. And then they did....nothing? For years? They release a statement saying that right when third party analysts start mentioning it? Also, they conveniently fail to mention what batches were affected by the issue. Still trying to figure that out after years eh?

Something smells funny.

edit2: Just putting this out there as well, he rambles for a whole hour but the first two minutes are pretty informative lol

GN didn't leave anything for Intel to hold onto i suppose. Fed up so much it is hard to believe.
The worst part is, Intel knew about it and still went with it. I only feel sorry for those who have not received RMA's of the product even though they deserved it. Not informing partners about the issue is literally speaking disgusting.
I'm gonna go on a limb and say "to be like Intel" meaning very dishonest. Not a shocker either.
Posted on Reply
#258
Zubasa
ratirtGN did leave anything for Intel to hold onto i suppose. Fed up so much it is hard to believe.
The worst part is, Intel knew about it and still went with it. I only feel sorry for those who have not received RMA's of the product even though they deserved it. Not informing partners about the issue is literally speaking disgusting.
I'm gonna go on a limb and say "to be like Intel" meaning very dishonest. Not a shocker either.
Worse, some users actually recieved RMA units that are also faulty and died/became unstable promptly.
I am not even sure if Intel truly knows which units are safe.
Posted on Reply
#259
ratirt
ZubasaWorse, some users actually recieved RMA units that are also faulty and died/became unstable promptly.
I am not even sure if Intel truly knows which units are safe.
Oh boy. That is below the belt to be honest.
I'm sure they know but that's beside the point because , maybe all of them are bad and it's just hard to admit it. Sometimes it is better to play dumb and look for a solution to the mess they have made. The microcode reasoning is just ridiculous here. They just beat around the bush to buy more time. Or, i really don't know what they are doing to be honest. i also don't care. Lost interest in Intel long time ago.
Posted on Reply
#260
trparky
ratirtLost interest in Intel long time ago.
Me too, at least... ever since Ryzen.
Posted on Reply
#261
Jism
trparkyMe too, at least... ever since Ryzen.
There's some weird choices that intel makes on their high end offerings.

Temperature limit of 110 degree hotspot or core.
Boosting beyond 6Ghz at the toes of the silicon,
Nics, NUCS (Atom) failing - same degradation happening
Power targets of well over 244W to even 350W for the higher end models.

AMD on the other hand has a certain trust in the products they bring out and PBO for example is going to last you years.

I still have a 2700X at this point, PBO for years and slightly undervolted. No issues at all.
Posted on Reply
#262
LittleBro
Seriously ... what y'all have been expecting?
So, Intel ... eating insane amounts of Watts (300+) in benchmarks just to break every record and to dominate over AMD by a few digits of % ... Was it all worth it? Now you have the answer.

After years of Intel product paper releases and product launch postponing they said the'll accelerate the arrival of upcoming generations. They said they'll deliver on time, maybe even 2 generations per year. Now you have it. Lack of time for proper product development and testing caused all this desperate voltage-hungry benchmark-record-breaking unreliable CPUs to exist at first place. It's about goddamn time Intel realized that the proper way of progress (in terms of increasing IPC) is to modify (improve) the architecture. But you can't do that without enough time, right? Touching architecture was the AMD's approach with Zen, Zen 2, Zen 3 and Zen 5 (Zen 4 excluding). [Many don't see it but the Zen 5 is not a minor architectural improvement over Zen 4.]

This headless approach to drop HT so that they push the core clocks (& voltages of course) even higher than before was pointless at first place. Now they're planning on dropping the e-cores after they had made claims that the current e-core has IPC comparable to Raptor Lake P-Core? Why the hell would you want to drop such a good core? Perhaps to free up the resources for another round of brute force freq/voltage pushing round?

They invented the HT, they invented the e-cores for the desktop. What amount of money was put into the development of these ... They managed to win the battle with core scheduling problems and the battle with e-cores being much less powerful than P-cores. Now Arrow Lake will be stripped of HT, Bartlett Lake will be stripped of e-cores. This kind of mess seems like a trial & error approach to me.

Admitting the oxidation issues more than a year after the 13th Gen products were released (and not telling anything to anyone) really needs to see a court. They need to be fined an amount that is of a considerable loss to them - like 10% worth of their year's revenue or so.
Posted on Reply
#263
mkppo
phanbueyThey put desktop CPUs in older blades that weren't designed for those chips -- it was a custom hack, with intel's datacenter team but a custom hack nonetheless. You're making it sound like this is a common thing -- it's not. We build our racks and servers as well, and "Blades of 14900Ks with 128gbs of ECC ram" are not a normal setup. Server chips are usually xeons or pentiums and they run at 2.1-3.2 ghz max - and they sit there for 20 years doing it.

Of course these are not designed to burn out in a few months - but if you have blades burning out your 14900Ks ... why... put... more... 14900Ks... in those blades? Not saying that the chip is good, but when you're putting a yolked 14900K into a blade to save money this is kind of exactly the downside.
I'm aware that it's not common, it probably represents 1% of total servers. But among game hosting servers it's not as rare to put desktop CPU's/sockets because they want good single threaded performance, not a lot of threads and don't have the need to run the systems for 20 years or even 10. These systems also have a much smaller blast radius. These guys just want the CPU to be able to do it's job till the next upgrade cycle at which point they just upgrade the CPU's, all for a fraction of the cost of xeon/epyc.

It's precisely why AMD released their EPYC 4004 on AM5.

Also, it's not like they just kept putting more 14900K CPU's in new racks. Some devs initially built racks with 13900K/14900K CPU's but many have failed twice and second time around was with an underclock because they thought that stock clocks might be the issue for the first batch of CPU's to degrade. Makes sense to do that because they're getting the CPU's through RMA so it's 'free' other than the downtime and lost time through debugging. It's finally come to the point now that they're just replacing all the racks with AMD because, well, enough is enough

Edit: autocorrect on mobile sucks sorry
Posted on Reply
#264
ratirt
JismThere's some weird choices that intel makes on their high end offerings.

Temperature limit of 110 degree hotspot or core.
Boosting beyond 6Ghz at the toes of the silicon,
Nics, NUCS (Atom) failing - same degradation happening
Power targets of well over 244W to even 350W for the higher end models.

AMD on the other hand has a certain trust in the products they bring out and PBO for example is going to last you years.

I still have a 2700X at this point, PBO for years and slightly undervolted. No issues at all.
I had a 2700x. Jumped on the 5800x I currently use due to 6900xt purchase.
No problems with my set up and the old one, 2700x serves my brother till this day with a 5600xt.
There is too many unknowns with Intel at this point for me and my concerns extend with current way of things around the company.
Posted on Reply
#265
iameatingjam
Random thought... I wonder if 12th gen is going to start going up in the price if this new fix isn't all its cracked up to be. People will be looking for a way to have a working computer without having to swap out motherboards....
Posted on Reply
#266
Dorek
JismThere's some weird choices that intel makes on their high end offerings.

Temperature limit of 110 degree hotspot or core.
Boosting beyond 6Ghz at the toes of the silicon,
Nics, NUCS (Atom) failing - same degradation happening
Power targets of well over 244W to even 350W for the higher end models.

AMD on the other hand has a certain trust in the products they bring out and PBO for example is going to last you years.

I still have a 2700X at this point, PBO for years and slightly undervolted. No issues at all.
Also had 2700x but wasnt stable at all sadly with PBO.
Posted on Reply
#267
BoggledBeagle
I wonder if all the different theories about what is going on are not just superflous and the real culprit is simply the frequency itself. If the real safe frequency ensuring long term reliability for intensive 24/7 workloads on this manufacturing process is say 4,6 GHz, no wonder that things are breaking when you run the CPUs 1GHz quicker.

Every technology has its limits and the problematic Intel 10nm process even after all the improvements may not be able to handle high frequencies, due to its higher tendency to degradation at high temperatures (or high current density or both).
Posted on Reply
#268
R0H1T
There's no fixed frequency or exact upper limit till which processors can operate, it's a range hence the ~50% failure rate not 100% or so. As for the safe frequency(range) Intel probably knew it & yet gambled on the chips surviving this under heavy loads! I don't buy any theory which doesn't have the buck at Intel's feet & it's their decisions not just a bad batch of chips :shadedshu:
Posted on Reply
#269
MikeSnow
phanbueyOf course these are not designed to burn out in a few months - but if you have blades burning out your 14900Ks ... why... put... more... 14900Ks... in those blades? Not saying that the chip is good, but when you're putting a yolked 14900K into a blade to save money this is kind of exactly the downside.
It's not the blades that are burning the CPUs. The CPUs are burning themselves. And the main reason they are doing it is not to save money, but to get high single thread performance.
Posted on Reply
#270
R0H1T
Regardless of how much I call Intel names this is on the dumb people putting OCed chips, way past their limits, in servers! The reason why server chips have conservative clocks should be fairly obvious & why running desktop chips at 6Ghz @24*7 is a bad idea.
Posted on Reply
#271
Rabit
BoggledBeagleI wonder if all the different theories about what is going on are not just superflous and the real culprit is simply the frequency itself. If the real safe frequency ensuring long term reliability for intensive 24/7 workloads on this manufacturing process is say 4,6 GHz, no wonder that things are breaking when you run the CPUs 1GHz quicker.

Every technology has its limits and the problematic Intel 10nm process even after all the improvements may not be able to handle high frequencies, due to its higher tentedcy to degradation at high temperatures (or high current density or both).
First 12900K start dieing in game servers, this simply can be intel node used to this cpus have short lifespawn
Posted on Reply
#272
BoggledBeagle
R0H1TThere's no fixed frequency or exact upper limit till which processors can operate, it's a range hence the ~50% failure rate not 100% or so.
Yes, there is a particular fixed frequency as a reply to a question: What is the maximal frequency to run this load this many hours a day at this temperature, when you want only this percentace of CPUs to fail in this given period of time.

It seems that Intel threw any cautious and responsible behavior out of the window and they just cranked the frequency to the maximum that the CPUs will survive in the hands of the reviewers.
Posted on Reply
#273
MikeSnow
R0H1TRegardless of how much I call Intel names this is on the dumb people putting OCed chips, way past their limits, in servers! The reason why server chips have conservative clocks should be fairly obvious & why running desktop chips at 6Ghz @24*7 is a bad idea.
Why does Intel sell chips OCed way past their limits?
Posted on Reply
#274
R0H1T
Well I'm not Intel so I can't answer this on their behalf, although I also wouldn't mind them being sued for $100 billion for it. They got away pretty lightly for their 2004-06(?) OEM BS & ideally this time should be different.
Posted on Reply
#275
Sunny and 75
Intel need to learn and improve or there will be stagnation all over again.
Posted on Reply
Add your own comment
Oct 10th, 2024 16:46 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts