Tuesday, July 23rd 2024
Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon
Long-term reliability issues continue to plague Intel's 13th Gen and 14th Gen Core desktop processors based on the "Raptor Lake" microarchitecture, with users complaining that their processors have become unstable with heavy processing workloads, such as games. This includes the chips that have minor levels of performance tuning or overclocking. Intel had earlier isolated many of these stability issues to faulty CPU core frequency boosting algorithms, which it addressed through updates to the processor microcode that it got motherboard- and prebuilt manufacturers to distribute as UEFI firmware updates. The company has now come out with new findings of what could be causing these issues.
In a statement Intel posted on its website on Monday (22/07), the company said that it has been investigating the processors returned to it by users under warranty claims (which it has been replacing under the terms of its warranty). It has found that faulty processor microcode has been causing the processors to operate under excessive core voltages, leading to their structural degradation over time. "We have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor."Modern processor power management runs on an intricate clockwork of collaboration between software, firmware, and hardware, with the software constantly telling the hardware what levels of performance it wants, and the hardware managing its power- and thermal budgets by rapidly altering the power and clock speeds of the various components, such as CPU cores, caches, fabric, and other on-die components. A faulty collaboration between any of the three key components could break this clockwork, as has happened in this case.
Intel is releasing yet another microcode update to its 13th- and 14th Gen Core processors, which will address not just the faulty boosting algorithm issue the company unearthed in June, but also the faulty voltage management the company discovered now. This new microcode should be released some time around mid-August to partners (motherboard manufacturers and PC OEMs), who will then need to validate it on their machines, before passing it along to end-users as UEFI firmware updates.
Meanwhile, an interesting issue has come to light, which that some of Intel's processors built on the Intel 7 node are experiencing chemical oxidation of the die as they age. Intel responded to this, stating that it had discovered the oxidation manufacturing issues in 2023, and addressed it. The company also stated that die oxidation is not related to the stability issues it is embattled with.
Sources:
Intel Community, Intel (Reddit)
In a statement Intel posted on its website on Monday (22/07), the company said that it has been investigating the processors returned to it by users under warranty claims (which it has been replacing under the terms of its warranty). It has found that faulty processor microcode has been causing the processors to operate under excessive core voltages, leading to their structural degradation over time. "We have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor."Modern processor power management runs on an intricate clockwork of collaboration between software, firmware, and hardware, with the software constantly telling the hardware what levels of performance it wants, and the hardware managing its power- and thermal budgets by rapidly altering the power and clock speeds of the various components, such as CPU cores, caches, fabric, and other on-die components. A faulty collaboration between any of the three key components could break this clockwork, as has happened in this case.
Intel is releasing yet another microcode update to its 13th- and 14th Gen Core processors, which will address not just the faulty boosting algorithm issue the company unearthed in June, but also the faulty voltage management the company discovered now. This new microcode should be released some time around mid-August to partners (motherboard manufacturers and PC OEMs), who will then need to validate it on their machines, before passing it along to end-users as UEFI firmware updates.
Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages. We are continuing validation to ensure that scenarios of instability reported to Intel regarding its Core 13th/14th Gen desktop processors are addressed. Intel is currently targeting mid-August for patch release to partners following full validation. Intel is committed to making this right with our customers, and we continue asking any customers currently experiencing instability issues on their Intel Core 13th/14th Gen desktop processors reach out to Intel Customer Support for further assistance, the company stated.It's important to note here, that the microcode update won't fix the issues on processors already experiencing instability, but prevent it on chips that aren't. The instability is caused by irreversible physical degradation of the chip. These chips will, of course, be covered under warranty.
Meanwhile, an interesting issue has come to light, which that some of Intel's processors built on the Intel 7 node are experiencing chemical oxidation of the die as they age. Intel responded to this, stating that it had discovered the oxidation manufacturing issues in 2023, and addressed it. The company also stated that die oxidation is not related to the stability issues it is embattled with.
We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue, the company stated.If you feel your chip might be affected, you can file for an RMA.
387 Comments on Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon
Your typical production system is probably not something that matches said use. Based on your description of colleagues using the systems, you are likely describing some internal environment, and not a production one.
Not every counter strike server needs to be hosted on a triplicated, geographically redundant xeon system with ecc memory. They are still servers and running production workloads. 50% failure rate. And then you comment ”they should have known”.. like how??
With tech enthusiasts like on here maybe a few will not buy Intel for awhile. With the vast majority of consumers I don't think this is enough to switch. Probably what would have a greater effect on Intel going forward is an expensive class action suit. Shareholders don't like messy things like that.
That's the tradeoff -- they made that tradeoff purposely, and now that it's biting them and they're crying. This is like using walgreens power strips instead of PDUs and then crying about their uptime or when there's a defect in them. Yes 99% of the time it's fine, and yes if they're failing in people's homes its really bad... but you put one in your rack when everyone warns against that - so no it's not your fault that it melted while you were using it 'in spec' but the decision to use it there, if you require any kind of uptime, is extremely questionable. But these are not "Server" cpus... so yes your point stands the CPU is 100% at fault, and the desktops crashing is bad, and the laptop users is even worse (imo since it makes the reason of "voltages too high" make no sense). But the all of the sudden datacenter people coming out and being like 'AND OUR SERVERS ARE ALSO CRASHING' -wait what... why are you putting K sku desktop chips in your server. And the response of "but not everyone needs a xeon/epyc" is not valid -- this is exactly why you need a xeon.
If you show up with at a construction site with your Home Depot Ryobi Drill that you bought for $70 and it burns out after 8 hours of usage EVERYONE there will make fun of you. They'll most likely do it as soon as you take it out of the bag. Use the right tool for the job.
The more you think this through the more you'll realize the guys doing this are a lot more at fault that you'd admit to!
They used them because they have a downtime tolerance and they save a ton of money and get substantially more performance/$. It's generally bad practice though, and many consider it in the "f@#% around and find out" territory. And oftentimes they find out-- not always through catastrophic degredation, but erratta, memory issues, bugs etc.
This one guy tweeted "Production systems need to be stable" and in the same tweet thought that most of the failures were on ROG Strix Boards.
That's like saying - "our commercial vehicles need to be stable" and then being like "most of our failures have been on the pink Toys for Tots Jeeps."
"Server workload", "consumer workload", is just that : A "workload".
If CPU can't handle a workload it supports under normal operating conditions, and using recommended/stock settings - it's simply defective.
Defective CPUs should be covered under warranty (unless it expired).
From past perspective, there are multiple users of old hardware (10 years+), that simply put a private Minecraft server on consumer grade hardware, and have no issues (only CPU side). Expecting something else from modern stuff is a downgrade to quality of CPUs that must NOT happen.
I DO NOT want a CPU that will die just after warranty, because it was designed to fail that way or was "not rated for something, so it died faster".
Same goes for "you run bad program on it, so it failed faster" <= this is NOT how things worked so far.
You can argue that manufacturers job is to keep in mind all workloads a client can run on it, and make sure CPUs they make can take them in minimum warranty period time frame. However, giving something extra (I own a working Pentium MMX CPU), is how Intel got to where it is today.
Giving up life expectancy of CPUs past warranty period for few more % of performance earlier in their life is insanely easy way to grind company market share into the ground in long term (20 years+). I just assume competition simply doesn't do it and who will buy CPU that is guaranteed to die right after warranty period ends - "pro-sumer" or "casual" doesn't mean anything here - when alternative will just work fine for just few more years.
"We lost money by putting these cpus in our production servers"
Lawyer: "And are these SKUs desinged for 24/7 operation and uptime in production servers?"
"No"
...
Q : Does Intel provide number for MTBF on CPUs ?
no intel lawyer is going to state that the K sku’s are designed to only last a few months of actual use.
Toyota hilux is not necessary for such tasks. ”YoU aRe UsInG iT WrOnG”.
LMAO
that is the issue for some of the more audible affected users. The chips have been purchased in ’trays’, with limited warranties, and intel is being an ass and claiming that the failures are not defects covered by said limited warranties.
So besides all the evidence lets just ignore it and keep our head in the sand?
All crediblity out the window!
As for Intel, quiet options for handling this are no longer an option for them. Also, I read somewhere that as much as 1/3 of the 13th and 14th gen cpus are of the 13900 and 14900 models. Not sure if this number was for overall sales or just the boxed processors for end users. On top of that, one of the game developers were running about a 50% failure rate using W series server boards (read, no overclocking at all). I would assume with the latest bios patches as well.
What I don't get is why it has taken this long to "identify" the problem? Could Intel just be stalling for time to get past the Ryzen 9000 launch? Stall until Arrow Lake changes the narrative?
The other thing that has me wondering is that for Arrow Lake S, the latest news is that Intel is trying to push up the launch date by trying to get the QS samples out early and shorten the validation period. Sounds like they may not have learned any lessons from this experience.
P.S. For the record, the X370 bios issue was exactly about the size of the rom chips. When I upgraded my friends board, Gigabyte had two bios choices, early CPU's or later ones, If you updated for the newer chips, you could NOT load the older bios, and could no longer run 1st (and 2nd?) gen CPU's The part of a CPU with the microcode is called FIRMWARE, not software. It's not like you can just load it at will from anywhere, it's part of the CPU and (as it should be) not easily accessed or reflashed.
Server grade hardware is certified for high load 24/7 for x years. Running a private Minecraft server isn't exactly something that causes sustained peak load all the time (but hosting a lot of them for a company may be a lot more). Running a machine 24/7 without a significant load isn't much of a problem, I've been doing that for most of my machines for over a decade, and like many others in the industry I too have a home "server" for files/media/git/building/etc., but it's nothing with sustained high load, if so I would have to choose hardware accordingly.
When I had issues with my 13900KF, which is now on its way to Intel, I could run prime95 on all the cores without any issues for hours. But even something as trivial opening a new browser tab, with the computer idle, could result in a crashed tab due to access violations.
If I had a computer with a 13th gen or 14th gen CPU that is not currently experiencing stability issues I would refrain from using it until the microcode update is released. If I really had to use it, I would probably limit the maximum multiplier to x50 or less.
- will Intel publish serial numbers of CPUs affected by overvoltage and/or oxidation so that owners could identify them and file RMA?
- will second hand market go completely bonkers, as no buyer will know for sure whether they buy an affected CPU?
- how many online gaming companies will switch to AMD systems?
- will confidence in Intel brand and reliability suffer?
- so many questions...