Tuesday, July 23rd 2024

Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon

Long-term reliability issues continue to plague Intel's 13th Gen and 14th Gen Core desktop processors based on the "Raptor Lake" microarchitecture, with users complaining that their processors have become unstable with heavy processing workloads, such as games. This includes the chips that have minor levels of performance tuning or overclocking. Intel had earlier isolated many of these stability issues to faulty CPU core frequency boosting algorithms, which it addressed through updates to the processor microcode that it got motherboard- and prebuilt manufacturers to distribute as UEFI firmware updates. The company has now come out with new findings of what could be causing these issues.

In a statement Intel posted on its website on Monday (22/07), the company said that it has been investigating the processors returned to it by users under warranty claims (which it has been replacing under the terms of its warranty). It has found that faulty processor microcode has been causing the processors to operate under excessive core voltages, leading to their structural degradation over time. "We have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor."
Modern processor power management runs on an intricate clockwork of collaboration between software, firmware, and hardware, with the software constantly telling the hardware what levels of performance it wants, and the hardware managing its power- and thermal budgets by rapidly altering the power and clock speeds of the various components, such as CPU cores, caches, fabric, and other on-die components. A faulty collaboration between any of the three key components could break this clockwork, as has happened in this case.

Intel is releasing yet another microcode update to its 13th- and 14th Gen Core processors, which will address not just the faulty boosting algorithm issue the company unearthed in June, but also the faulty voltage management the company discovered now. This new microcode should be released some time around mid-August to partners (motherboard manufacturers and PC OEMs), who will then need to validate it on their machines, before passing it along to end-users as UEFI firmware updates.
Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages. We are continuing validation to ensure that scenarios of instability reported to Intel regarding its Core 13th/14th Gen desktop processors are addressed. Intel is currently targeting mid-August for patch release to partners following full validation. Intel is committed to making this right with our customers, and we continue asking any customers currently experiencing instability issues on their Intel Core 13th/14th Gen desktop processors reach out to Intel Customer Support for further assistance, the company stated.
It's important to note here, that the microcode update won't fix the issues on processors already experiencing instability, but prevent it on chips that aren't. The instability is caused by irreversible physical degradation of the chip. These chips will, of course, be covered under warranty.

Meanwhile, an interesting issue has come to light, which that some of Intel's processors built on the Intel 7 node are experiencing chemical oxidation of the die as they age. Intel responded to this, stating that it had discovered the oxidation manufacturing issues in 2023, and addressed it. The company also stated that die oxidation is not related to the stability issues it is embattled with.
We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue, the company stated.
If you feel your chip might be affected, you can file for an RMA.
Sources: Intel Community, Intel (Reddit)
Add your own comment

387 Comments on Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon

#301
trparky
I know that this question is going to go off on a bit of a tangent but... Is anyone going to really trust the next generation of Intel chips to have no faults? Has Intel's trust been tarnished?
Posted on Reply
#302
Dredi
phanbueyThey're not 'users' though -- they're my colleagues - I work with people that do this, and I do this. They absolutely know what they're doing. And when they put consumer 'overclocking' SKUs, whether intel or AMD, CPUs into their server farm and servers start crashing and they get all outraged and start tweeting about their farm, I have 0 sympathy.

100% sympathy to the end-users and the SIs that Intel did screw over. My point was that "SERVER" crash rates in a conversation about desktop overclocking cpus is less relevant since other sub-optimal decisions had to be made for that to happen.
Users in this example case would be the players playing the games on the servers that the game hosting entity manages. For that use, again, the consumer parts have been vastly superior to the xeon parts. Those are still headless servers and they are running production workloads. And those workloads do not need to be 100% stable, but 50% stable as with the 14th gen parts is VASTLY different to the one people have been used to in the last X years.

Your typical production system is probably not something that matches said use. Based on your description of colleagues using the systems, you are likely describing some internal environment, and not a production one.

Not every counter strike server needs to be hosted on a triplicated, geographically redundant xeon system with ecc memory. They are still servers and running production workloads. 50% failure rate. And then you comment ”they should have known”.. like how??
Posted on Reply
#303
MikeSnow
phanbueyThey're not 'users' though -- they're my colleagues - I work with people that do this, and I do this. They absolutely know what they're doing. And when they put consumer 'overclocking' SKUs, whether intel or AMD, CPUs into their server farm and servers start crashing and they get all outraged and start tweeting about their farm, I have 0 sympathy.

100% sympathy to the end-users and the SIs that Intel did screw over. My point was that "SERVER" crash rates in a conversation about desktop overclocking cpus is less relevant since other sub-optimal decisions had to be made for that to happen.
It has nothing to do with sympathy, or what the owners of those servers should have or should not have been using. Servers have been brought into these discussions mainly to prove that not only regular consumers are affected. Until recently, Intel suggested it's either the motherboard manufacturers pushing the CPUs too far, or the consumers themselves. The discussion about servers showed that there was more than consumers or MB manufacturers doing "the wrong thing", because for server/workstation class motherboards it is less likely that non-recommended defaults or overclocks have been used. And that proved to be true, the CPU itself was the main cause of these issues.
Posted on Reply
#304
64K
trparkyI know that this question is going to go off on a bit of a tangent but... Is anyone going to really trust the next generation of Intel chips to have no faults? Has Intel's trust been tarnished?
My following opinion has nothing to do with what I think Intel probably deserves. Just an observation of what people say on the internet and what they do being entirely different things sometimes and about how entrenched the Intel brand is with the vast majority of consumers.

With tech enthusiasts like on here maybe a few will not buy Intel for awhile. With the vast majority of consumers I don't think this is enough to switch. Probably what would have a greater effect on Intel going forward is an expensive class action suit. Shareholders don't like messy things like that.
Posted on Reply
#305
phanbuey
DrediUsers in this example case would be the players playing the games on the servers that the game hosting entity manages. For that use, again, the consumer parts have been vastly superior to the xeon parts. Those are still headless servers and they are running production workloads. And those workloads do not need to be 100% stable, but 50% stable as with the 14th gen parts is VASTLY different to the one people have been used to in the last X years.

Your typical production system is probably not something that matches said use. Based on your description of colleagues using the systems, you are likely describing some internal environment, and not a production one.

Not every counter strike server needs to be hosted on a triplicated, geographically redundant xeon system with ecc memory. They are still servers and running production workloads. 50% failure rate. And then you comment ”they should have known”.. like how??
Because they are datacenter engineers and they know that a "production" counterstrike server that is making money <> your counterstrike server that you're running for funsies in your living room. And if you use ROG boards and K sku processors, yes your tic rate will be higher and your performance/$ higher, but your uptime will suffer. That is exactly why people that make money with their servers tend to use xeons. There are game servers using xeons and none of those people are crying on twitter rn.

That's the tradeoff -- they made that tradeoff purposely, and now that it's biting them and they're crying. This is like using walgreens power strips instead of PDUs and then crying about their uptime or when there's a defect in them. Yes 99% of the time it's fine, and yes if they're failing in people's homes its really bad... but you put one in your rack when everyone warns against that - so no it's not your fault that it melted while you were using it 'in spec' but the decision to use it there, if you require any kind of uptime, is extremely questionable.
MikeSnowIt has nothing to do with sympathy, or what the owners of those servers should have or should not have been using. Servers have been brought into these discussions mainly to prove that not only regular consumers are affected. Until recently, Intel suggested it's either the motherboard manufacturers pushing the CPUs too far, or the consumers themselves. The discussion about servers showed that there was more than consumers or MB manufacturers doing "the wrong thing", because for server/workstation class motherboards it is less likely that non-recommended defaults or overclocks have been used. And that proved to be true, the CPU itself was the main cause of these issues.
But these are not "Server" cpus... so yes your point stands the CPU is 100% at fault, and the desktops crashing is bad, and the laptop users is even worse (imo since it makes the reason of "voltages too high" make no sense). But the all of the sudden datacenter people coming out and being like 'AND OUR SERVERS ARE ALSO CRASHING' -wait what... why are you putting K sku desktop chips in your server. And the response of "but not everyone needs a xeon/epyc" is not valid -- this is exactly why you need a xeon.

If you show up with at a construction site with your Home Depot Ryobi Drill that you bought for $70 and it burns out after 8 hours of usage EVERYONE there will make fun of you. They'll most likely do it as soon as you take it out of the bag. Use the right tool for the job.
Posted on Reply
#306
MikeSnow
phanbueyBut these are not "Server" cpus... so yes the desktops crashing is bad, and the laptop users is even worse (imo since it makes the reason of "voltages too high" make no sense). But the all of the sudden datacenter people coming out and being like 'AND OUR SERVERS ARE ALSO CRASHING' -wait what... why are you putting K sku desktop chips in your server. And the response of "but not everyone needs a xeon/epyc" is not valid -- this is exactly why you need a xeon.

If you show up with at aconstruction site with your Home Depot Ryobi Drill that you bought for $70 and it burns out after 8 hours of usage EVERYONE there will make fun of you. They'll most likely do it as soon as you take it out of the bag. Use the right tool for the job.
Nobody said those people using them on servers expected no crashes on them, or Xeon level stability. Normal consumer level stability and failure rates would have been enough for those people, otherwise they wouldn't have used them. The problem is that these CPUs are not even consumer level, as they expected. In my opinion these CPUs are "do not use for any purpose unless you want to risk wasting time or money" level. At least until the new microcode is released.
Posted on Reply
#307
R0H1T
But "6Ghz" has never been normal. Why would you go with something absolutely brand new & having basically zero reliability or history to lean on :wtf:

The more you think this through the more you'll realize the guys doing this are a lot more at fault that you'd admit to!
Posted on Reply
#308
phanbuey
MikeSnowNobody said those people using them on servers expected no crashes on them, or Xeon level stability. Normal consumer level stability and failure rates would have been enough for those people, otherwise they wouldn't have used them. The problem is that these CPUs are not even consumer level, as they expected. In my opinion these CPUs are "do not use for any purpose unless you want to risk wasting time or money" level. At least until the new microcode is released.
Absolutely -- you're right they are consumer level and they failed on that level beyond any excuse.

They used them because they have a downtime tolerance and they save a ton of money and get substantially more performance/$. It's generally bad practice though, and many consider it in the "f@#% around and find out" territory. And oftentimes they find out-- not always through catastrophic degredation, but erratta, memory issues, bugs etc.

This one guy tweeted "Production systems need to be stable" and in the same tweet thought that most of the failures were on ROG Strix Boards.

That's like saying - "our commercial vehicles need to be stable" and then being like "most of our failures have been on the pink Toys for Tots Jeeps."
Posted on Reply
#309
agent_x007
phanbueyProduction workloads, by definition, require stability and performance, but first and foremost stability. You can build a mac mini farm, or a rasberry pi farm, or a farm of blades running game servers run a specific workload and call it 'Production', but if your processor comes with marketing materials with the words 'Exxxtreme' or 'Overclocking' or 'Gamers' it is not a production class system.
That is a VERY dangerous statement, because you are telling non-pro consumers to expect a BSOD or a "locked" system once for a while, because it's "consumer grade" and not "pro" grade, if they did something "non-consumer" with it.

"Server workload", "consumer workload", is just that : A "workload".
If CPU can't handle a workload it supports under normal operating conditions, and using recommended/stock settings - it's simply defective.
Defective CPUs should be covered under warranty (unless it expired).

From past perspective, there are multiple users of old hardware (10 years+), that simply put a private Minecraft server on consumer grade hardware, and have no issues (only CPU side). Expecting something else from modern stuff is a downgrade to quality of CPUs that must NOT happen.
I DO NOT want a CPU that will die just after warranty, because it was designed to fail that way or was "not rated for something, so it died faster".
Same goes for "you run bad program on it, so it failed faster" <= this is NOT how things worked so far.

You can argue that manufacturers job is to keep in mind all workloads a client can run on it, and make sure CPUs they make can take them in minimum warranty period time frame. However, giving something extra (I own a working Pentium MMX CPU), is how Intel got to where it is today.
Giving up life expectancy of CPUs past warranty period for few more % of performance earlier in their life is insanely easy way to grind company market share into the ground in long term (20 years+). I just assume competition simply doesn't do it and who will buy CPU that is guaranteed to die right after warranty period ends - "pro-sumer" or "casual" doesn't mean anything here - when alternative will just work fine for just few more years.
Posted on Reply
#310
phanbuey
agent_x007That is a VERY dangerous statement, because you are telling non-pro consumers to expect a BSOD or a "locked" system once for a while, because it's "consumer grade" and not "pro" grade, if they did something "non-consumer" with it.
that's not a dangerous statement - that's exactly the defense that will be used in court...

"We lost money by putting these cpus in our production servers"
Lawyer: "And are these SKUs desinged for 24/7 operation and uptime in production servers?"
"No"
...
Posted on Reply
#312
Dredi
phanbueybut your uptime will suffer. That is exactly why people that make money with their servers tend to use xeons.
Yep. They used to suffer like 1%, while xeons were at 0.1%, a calculated trade-off. Now they suffer like 50% and ”they should have known”. Like how?

no intel lawyer is going to state that the K sku’s are designed to only last a few months of actual use.
Posted on Reply
#313
thesmokingman
Lmao, from the other thread to this one, now the cope is blaming the chip users ie. devs using these chips in game servers. Lmao, the cope is at the max now.
Posted on Reply
#314
Dredi
phanbueyThis one guy tweeted "Production systems need to be stable" and in the same tweet thought that most of the failures were on ROG Strix Boards.

That's like saying - "our commercial vehicles need to be stable" and then being like "most of our failures have been on the pink Toys for Tots Jeeps."
Quite many production vehicles are renault clios, toyota yaris’s, etc. Pizza delivery. Security crews. Etc.

Toyota hilux is not necessary for such tasks.
thesmokingmanLmao, from the other thread to this one, now the cope is blaming the chip users ie. devs using these chips in game servers. Lmao, the cope is at the max now.
”YoU aRe UsInG iT WrOnG”.

LMAO
Posted on Reply
#315
phanbuey
DrediYep. They used to suffer like 1%, while xeons were at 0.1%, a calculated trade-off. Now they suffer like 50% and ”they should have known”. Like how?

no intel lawyer is going to state that the K sku’s are designed to only last a few months of actual use.
That's not what they're going to argue. They're going to argue that they're not liable for the damages because that product is not designed for that use - that their liability will stop at replacing the CPU.
Dredi"YoU aRe UsInG [the] WrOnG [chip/platform for your servers]”.
Fixed.
Posted on Reply
#316
Dredi
phanbueyThat's not what they're going to argue. They're going to argue that they're not liable for the damages because that product is not designed for that use - that their liability will stop at replacing the CPU.
They are not even replacing the cpu’s …

that is the issue for some of the more audible affected users. The chips have been purchased in ’trays’, with limited warranties, and intel is being an ass and claiming that the failures are not defects covered by said limited warranties.
Posted on Reply
#317
phanbuey
DrediThey are not even replacing the cpu’s …
They're 1000% getting sued though, so... they will.
Posted on Reply
#318
Dredi
phanbueyThey're 1000% getting sued though, so... they will.
Hope so. One game hosting service has about 1000 RMA’s denied.
Posted on Reply
#319
Makaveli
fevgatosI'm not going to punish a company based on someone else's experience. Intel cpus have worked flawlessly for me so there is that, I'll keep buying until they don't.
This was probably the worse thing you could have posted in this thread.

So besides all the evidence lets just ignore it and keep our head in the sand?

All crediblity out the window!
Posted on Reply
#320
SRB151
TomorrowI can also offer a worse example: Sony.
I bought their Linkbuds S wireless earbuds in 2022 only for these to develop a battery discharge issue a week after my two year warranty period ended (both go from 100% to empty withing 15-30 minutes instead of usual ~8 hours).
Reading reviews and comments online there are many people facing the same issue with both the Linkbuds S and WF-1000 XM4 models produced and bought in 2022.

Yet Sony has not even acknowledged the issue nor provided any replacements for customers because in their eyes the warranty period for both products (for 2022 buyers at least) has ended and thus they feel they dont have to do anything.

A product failing a week after warranty ended feels like planned obsolescence...
If it was the same issue as my WF series buds, Sony did something similar to the auto manufactures, they would replace the internal batteries at no charge if you sent them in for repair without advertising/publishing it. Interestingly enough, kind of on topic here, one of the app updates sent out by Sony was the fix to keep the buds from wiping out the battery.

As for Intel, quiet options for handling this are no longer an option for them. Also, I read somewhere that as much as 1/3 of the 13th and 14th gen cpus are of the 13900 and 14900 models. Not sure if this number was for overall sales or just the boxed processors for end users. On top of that, one of the game developers were running about a 50% failure rate using W series server boards (read, no overclocking at all). I would assume with the latest bios patches as well.
What I don't get is why it has taken this long to "identify" the problem? Could Intel just be stalling for time to get past the Ryzen 9000 launch? Stall until Arrow Lake changes the narrative?

The other thing that has me wondering is that for Arrow Lake S, the latest news is that Intel is trying to push up the launch date by trying to get the QS samples out early and shorten the validation period. Sounds like they may not have learned any lessons from this experience.

P.S. For the record, the X370 bios issue was exactly about the size of the rom chips. When I upgraded my friends board, Gigabyte had two bios choices, early CPU's or later ones, If you updated for the newer chips, you could NOT load the older bios, and could no longer run 1st (and 2nd?) gen CPU's
ShrekNot a Defect: Intel Blames 13th, 14th Gen CPU Crashes on Software Bug (msn.com)
The part of a CPU with the microcode is called FIRMWARE, not software. It's not like you can just load it at will from anywhere, it's part of the CPU and (as it should be) not easily accessed or reflashed.
Posted on Reply
#321
efikkan
agent_x007"Server workload", "consumer workload", is just that : A "workload".
If CPU can't handle a workload it supports under normal operating conditions, and using recommended/stock settings - it's simply defective.
Defective CPUs should be covered under warranty (unless it expired).

From past perspective, there are multiple users of old hardware (10 years+), that simply put a private Minecraft server on consumer grade hardware, and have no issues (only CPU side).
The workload absolutely matters, and I think everyone knows that. Using consumer grade hardware for server use isn't going to cause legal problems for Intel.
Server grade hardware is certified for high load 24/7 for x years. Running a private Minecraft server isn't exactly something that causes sustained peak load all the time (but hosting a lot of them for a company may be a lot more). Running a machine 24/7 without a significant load isn't much of a problem, I've been doing that for most of my machines for over a decade, and like many others in the industry I too have a home "server" for files/media/git/building/etc., but it's nothing with sustained high load, if so I would have to choose hardware accordingly.
Posted on Reply
#322
MikeSnow
efikkanThe workload absolutely matters, and I think everyone knows that. Using consumer grade hardware for server use isn't going to cause legal problems for Intel.
Server grade hardware is certified for high load 24/7 for x years. Running a private Minecraft server isn't exactly something that causes sustained peak load all the time (but hosting a lot of them for a company may be a lot more). Running a machine 24/7 without a significant load isn't much of a problem, I've been doing that for most of my machines for over a decade, and like many others in the industry I too have a home "server" for files/media/git/building/etc., but it's nothing with sustained high load, if so I would have to choose hardware accordingly.
Ironically, in this particular case high load (multithreaded, all cores) could make these CPUs last longer, because they would reach the power limit and run the cores at lower frequencies than the maximum turbo boost, meaning also lower voltages. The worst would be loading just a core or two for long periods. And even keeping it mostly idle could result in periodic frequency and voltage spikes and some amount of degradation.

When I had issues with my 13900KF, which is now on its way to Intel, I could run prime95 on all the cores without any issues for hours. But even something as trivial opening a new browser tab, with the computer idle, could result in a crashed tab due to access violations.

If I had a computer with a 13th gen or 14th gen CPU that is not currently experiencing stability issues I would refrain from using it until the microcode update is released. If I really had to use it, I would probably limit the maximum multiplier to x50 or less.
Posted on Reply
#323
Tek-Check
What is going to happen now that the genie is out of the bottle?
- will Intel publish serial numbers of CPUs affected by overvoltage and/or oxidation so that owners could identify them and file RMA?
- will second hand market go completely bonkers, as no buyer will know for sure whether they buy an affected CPU?
- how many online gaming companies will switch to AMD systems?
- will confidence in Intel brand and reliability suffer?
- so many questions...
Posted on Reply
#324
MikeSnow
Intel seems to be brewing something regarding CPU warranties:
Processor warranty validation will be down for planned maintenance on the Online Service Center during July 24th, 2024, 8:00 PM (PST) until July 24th, 2024, 09:30 PM (PST).
Posted on Reply
#325
trparky
I know that my confidence in the Intel brand has been shaken.
Posted on Reply
Add your own comment
Oct 10th, 2024 16:44 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts