Tuesday, July 23rd 2024
Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon
Long-term reliability issues continue to plague Intel's 13th Gen and 14th Gen Core desktop processors based on the "Raptor Lake" microarchitecture, with users complaining that their processors have become unstable with heavy processing workloads, such as games. This includes the chips that have minor levels of performance tuning or overclocking. Intel had earlier isolated many of these stability issues to faulty CPU core frequency boosting algorithms, which it addressed through updates to the processor microcode that it got motherboard- and prebuilt manufacturers to distribute as UEFI firmware updates. The company has now come out with new findings of what could be causing these issues.
In a statement Intel posted on its website on Monday (22/07), the company said that it has been investigating the processors returned to it by users under warranty claims (which it has been replacing under the terms of its warranty). It has found that faulty processor microcode has been causing the processors to operate under excessive core voltages, leading to their structural degradation over time. "We have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor."Modern processor power management runs on an intricate clockwork of collaboration between software, firmware, and hardware, with the software constantly telling the hardware what levels of performance it wants, and the hardware managing its power- and thermal budgets by rapidly altering the power and clock speeds of the various components, such as CPU cores, caches, fabric, and other on-die components. A faulty collaboration between any of the three key components could break this clockwork, as has happened in this case.
Intel is releasing yet another microcode update to its 13th- and 14th Gen Core processors, which will address not just the faulty boosting algorithm issue the company unearthed in June, but also the faulty voltage management the company discovered now. This new microcode should be released some time around mid-August to partners (motherboard manufacturers and PC OEMs), who will then need to validate it on their machines, before passing it along to end-users as UEFI firmware updates.
Meanwhile, an interesting issue has come to light, which that some of Intel's processors built on the Intel 7 node are experiencing chemical oxidation of the die as they age. Intel responded to this, stating that it had discovered the oxidation manufacturing issues in 2023, and addressed it. The company also stated that die oxidation is not related to the stability issues it is embattled with.
Sources:
Intel Community, Intel (Reddit)
In a statement Intel posted on its website on Monday (22/07), the company said that it has been investigating the processors returned to it by users under warranty claims (which it has been replacing under the terms of its warranty). It has found that faulty processor microcode has been causing the processors to operate under excessive core voltages, leading to their structural degradation over time. "We have determined that elevated operating voltage is causing instability issues in some 13th/14th Gen desktop processors. Our analysis of returned processors confirms that the elevated operating voltage is stemming from a microcode algorithm resulting in incorrect voltage requests to the processor."Modern processor power management runs on an intricate clockwork of collaboration between software, firmware, and hardware, with the software constantly telling the hardware what levels of performance it wants, and the hardware managing its power- and thermal budgets by rapidly altering the power and clock speeds of the various components, such as CPU cores, caches, fabric, and other on-die components. A faulty collaboration between any of the three key components could break this clockwork, as has happened in this case.
Intel is releasing yet another microcode update to its 13th- and 14th Gen Core processors, which will address not just the faulty boosting algorithm issue the company unearthed in June, but also the faulty voltage management the company discovered now. This new microcode should be released some time around mid-August to partners (motherboard manufacturers and PC OEMs), who will then need to validate it on their machines, before passing it along to end-users as UEFI firmware updates.
Intel is delivering a microcode patch which addresses the root cause of exposure to elevated voltages. We are continuing validation to ensure that scenarios of instability reported to Intel regarding its Core 13th/14th Gen desktop processors are addressed. Intel is currently targeting mid-August for patch release to partners following full validation. Intel is committed to making this right with our customers, and we continue asking any customers currently experiencing instability issues on their Intel Core 13th/14th Gen desktop processors reach out to Intel Customer Support for further assistance, the company stated.It's important to note here, that the microcode update won't fix the issues on processors already experiencing instability, but prevent it on chips that aren't. The instability is caused by irreversible physical degradation of the chip. These chips will, of course, be covered under warranty.
Meanwhile, an interesting issue has come to light, which that some of Intel's processors built on the Intel 7 node are experiencing chemical oxidation of the die as they age. Intel responded to this, stating that it had discovered the oxidation manufacturing issues in 2023, and addressed it. The company also stated that die oxidation is not related to the stability issues it is embattled with.
We can confirm that the via Oxidation manufacturing issue affected some early Intel Core 13th Gen desktop processors. However, the issue was root caused and addressed with manufacturing improvements and screens in 2023. We have also looked at it from the instability reports on Intel Core 13th Gen desktop processors and the analysis to-date has determined that only a small number of instability reports can be connected to the manufacturing issue, the company stated.If you feel your chip might be affected, you can file for an RMA.
387 Comments on Intel Statement on 13th and 14th Gen Core Instability: Faulty Microcode Causes Excessive Voltages, Fix Out Soon
From what I understand, and from CPU only point of view (because topic) - there is no "time to fail" timeframe.
You can run whatever thing you want, for however long you want (using validated hardware and settings).
The only limitation from Intel side, is warranty period that CPU is guaranteed to work (with replacement option if it doesn't hold up that long).
If you can provide some Intel documents, that state "running X program will decrease your warranty due to CPU wear", or "using this consumer CPU for XYZ task voids your warranty on it", we can get somewhere - otherwise, there is no point.
Again, x86 standard was made for a reason (both from hardware and software perspective).
"Consumer grade" hardware is made to work everywhere, regardless of use case.
"Prosumer" stuff get's extras that are wanted/required by companies (to guarantee whatever they need/want out of the hardware), and simplify process of making Intel allegeable for damages (if something goes wrong on their end).
The former doesn't mean, Intel is immune to hardware failures of their own making on "consumer grade" hardware (which is what current situation is BTW). Process is just more lengthy, and more complicated in such cases.
Issue : Highest frequency on TVB, because of combination for high single core usage + low running temps = high chance of degradation issue due to very high VID requested by CPU itself to make itself "stable" under such conditions.
^This is not "bad use case", it is bad manufacturer practices.
IF manufacturer knows what it's doing, there are no "bad use cases".
"Dont ask questions, here's the new cpu, no ecores, put it in there and don't ask questions"
sellingpromising you 10nm was fine :wtf:dosimetervoltage meters only went to 1.35V.............*tests TVB*
Engineer: Only reads 1.35V boss
Boss: Ship that f*cker.
I would expect a company the size of Intel who owns their own fabs to do this right and fix it as soon as possible, knowing which cpu's were affected. Except things like oxidization at the fabrication level, cpu failures after a year, or cpu's failing new out of the box aren't normal things to go wrong.
The BIOS updates aren't a fix for cpu's already degraded, and baseline power limits means lesser than claimed performance. I'm surprised someone hasn't started up a class action yet.
Also we have yet to see if the August BIOS update is really a fix, I personally don't trust Intel on the microcode to be a real fix, just postponing degradation until out of warranty failure. That is fine for shareholders and OEM's, but not for consumers that have already gotten screwed with potentially defective cpu's, it's even worse for laptop owners as people are experiencing similar problems to desktop cpu's, Intel seems to have ignored that issue and instead blames OEM's.
Also as I mentioned earlier, it's pretty common for these game hosting servers to use consumer grade cpu's and Intel were perfectly happy to supply them with such. It's why AMD released EPYC 4004 so surely there's a market and this not entirely atypical, albeit rare compared to the server market as a whole.
W680 are workstation, not server. I agree with that.
Also, I'm not speculating. Intel obviously knew about the oxidization, because they fixed it themselves. But they never disclosed it, and there's proof of them denying RMA to many vendors even though they knew full well there's a good chance those chips might have been melons. To release a statement saying there was a case of oxidization in 2023 AFTER third party reviewers are suspecting it just looks dodgy
Nothing new though... Intel has been recently pushing to scrape off every last cent of perf to champion the race. Those high power consumption numbers on full throttle were always concerning and performance degradation was highly probable - it was only a matter of time for the shit to hit the fan. Its unfortunate though... there was no need to push the already lavish perf we're seeing from both camps. For gamers, even 12th gen (or Zen 3 X3D) is a blast.
You know you can get OS to show you 100% utilization even when it's at lighter workload, right? FPU stress test or AVX workload is different workload to gaming, yet gaming might easily show 100%.
Workload theory would not stand a chance in a court. Simply, it's nowhere explicitly defined. What is the level of light and heavy workload? What instruction sets?
Is heavy workload that scenario, in which CPU keeps getting into temps above 100°C for specified amount of time? What if the cooler is not installed properly? You can get high temps that way even with light workload. Let's monitor voltage, current and watts during a specified period of time. But, then, uhm ... this might be heavy workload for Intel CPU but AMD might handle it with much less resources. So what now? Well, it comes to a workload definition per chip basis - a different definition for every one CPU SKU, because with different resources (cores, clocks) the workload limits would change as well. It is not defined and I doubt that someone would be crazy enough to try to do it. (But I might be wrong. Please post some link if such definitions do exist.)
Again, there is nothing wrong in using non-server (consumer grade) CPU for a server tasks. It may be dumb but it's not wrong. However, it must endure anyway, be it consumer grade or not. It must have implemented various sorts of safeguards to prevent itself from being damaged by ANY type of workload. That's what current, thermal and power draw protections are there for. If under any workload such protection allows CPU to get degraded, well, then it's a shitty/pointless protection to me.
If there is enough knowledge about a particular CPU (or process node) that it is prone to degrade when stressed by more then 1.45V during XY minutes, then do something yourself to avoid it at all costs and don't do it by passing the responsibility on to a users or motherboard manufacturers. Apply some sort of "workload throttling" that would decrease clocks and voltages, so that the stress put onto the CPU is lowered after some time. But hey, isn't that what PL1 and PL2 are (kind of) there for?
Let us shadow the AMD and dominate the every benchmark by setting the limits to infinity (and beyond)! And then the shit hits the fan ...
Regardless of the fact that they used non-server CPU as a server CPU for game hosting service, the CPU was configured/allowed to operate in critical conditions by the manufacturer itself.
Well, Intel was supporting the whole Extreme profile theory for a while now and has been giving the motherboard makers free hand. This whole scandal was about to happen.
The main differentiating factors between consumer and enterprise CPU's are supported features, level of support and performance. Not their ability to run a certain workload for xx minutes on a specific SKU without failure with some kind of MTBF metric. Indeed. If degradation really is happening due to excessive voltage and microcode like Intel says then ALL 13th and 14th gen CPU's that are in use are already degraded. It's not a matter of IF they will fail, it's a matter of WHEN they will fail. Any further fixes to voltage only slows down further degradation but the damage is already done. There is no way to reverse the damage that has already been done at silicon level.
10 cores, 4 memory channels, AVX512, Even Plays Ultra BluRay disks
Personally, I'd prefer getting back money to getting CPU replacement. It's been multiple times proven that there is more than a slight chance than you will get another affected CPU as a result of the RMA process. In some extreme cases those CPUs died within a week while the old chips started to malfunction after few months.
I just don't get it. This whole thing is about quality assurance - testing, testing, testing. I thought that it was impossible for such issue to emerge past 90's. The QA has really evolved since then. Intel just needed to thoroughly test few SKUs. That actually seems to be quite an easy task compared to AMD's X3D SoC voltage issue - testing the X3D CPUs in all the supported motherboards is near to impossible.
Not sure how a CPU helps play BluRay but ok.
So in summary, they were over volting CPUs (interesting as we have been discussing that board vendors have been applying under volts), and as a result "some" chips have degraded to the point they are unstable, Intel will grant RMA for these chips.
I would like to see a number of years added to that statement for how long they will approve RMA for, because in my opinion it needs to probably be 5 years minimum. Not a joke 1-2 years.
Luckily I under volt my chip on both cache and core, it is working fine currently. You can apply this to every tech produced, most PC related hardware is "trying something new" when it first appears.
If I didnt buy from a company again after I (or friend) had problems, I would have blacklisted the following companies. There may be more, these are the ones I most easily remember.
AMD - FX degradation making stock unstable
Asus - Failed capacitors and unstable voltages in their "asus optimised" bios.
Asrock - Unsafe voltages in their bios when activating XMP and stock settings exceeding tjmax spec.
BenQ - Monitor with flawed display port (display wont wake up if turned off whilst using display port, system needs to be rebooted to wake up the port on the monitor).
Crucial - SSD shipped with flawed firmware.
EVGA - GPU shipped with unstable v/f curve out of the box.
Gigabyte - GPU shipped with unstable v/f curve out of the box on performance bios.
Kingston - Numerous SSDs failing.
Viewsonic - Failed monitor, and Monitor RMA switcheroo was a replacement with same fault.
My AMD Ryzen 7 7700 (non-X) CPU does everything I use my computer for pretty nicely with little power draw.
source : Exclusif – Processeurs Intel instables : 3 à 4 fois plus souvent en panne, certains définitivement condamnés - Les Numériques (lesnumeriques.com)
IMO, we're good in the CPU space. Its the GPU realm where i'd like to see AMD and/or Intel kick arse! Nvidia's got it too good, a license to bleed the consumer which is never a good thing.
So true this.
Does that make Seagate horribad, avoid at all costs? I don't think so... I'm in an overwhelming minority, and I'm sure some people have had the same ill luck with WD drives dying on them. People just take their clubism to the extreme sometimes.