Friday, June 14th 2024

Intel Isolates Root Cause of Raptor Lake Stability Issues to a Faulty eTVB Microcode Algorithm

Intel has identified the root cause for stability issues being observed with certain high-end 13th- and 14th Gen Core "Raptor Lake" processor models, which were causing games and other compute-intensive applications to randomly crash. When the issues were first identified, Intel recommended a workaround that would reduce core-voltages and restrict the boost headroom of these processors, which would end up with reduced performance. The company has apparently discovered the root cause of the problem, as Igor's Lab learned from confidential documents.

The documents say that Intel isolated the problem to a faulty value in the microcode's end of the eTVB (enhanced thermal velocity boost) algorithm. "Root cause is an incorrect value in a microcode algorithm associated with the eTVB feature. Implication Increased frequency and corresponding voltage at high temperature may reduce processor reliability. Observed Found internally," the document says, mentioning "Raptor Lake-S" (13th Gen) and "Raptor Lake Refresh-S" (14th Gen) as the affected products.
The company goes on to elaborate on the issue in its Failure Analysis (FA) document:
Failure Analysis (FA) of 13th and 14th Generation K SKU processors indicates a shift in minimum operating voltage on affected processors resulting from cumulative exposure to elevated core voltages. Intel analysis has determined a confirmed contributing factor for this issue is elevated voltage input to the processor due to previous BIOS settings which allow the processor to operate at turbo frequencies and voltages even while the processor is at a high temperature. Previous generations of Intel K SKU processors were less sensitive to these type of settings due to lower default operating voltage and frequency.
Identifying the root cause of the problem isn't the only good news, Intel also has a new microcode ready for 13th Gen and 14th Gen Core processors (version: 0x125), for motherboard manufacturers and PC OEMs to encapsulate into UEFI firmware updates. This new microcode corrects the issue, which should restore stability of these processors at their normal performance. Be on the lookout for UEFI firmware (BIOS) updates from your motherboard vendor or prebuilt OEM.
Source: Igor's Lab
Add your own comment

107 Comments on Intel Isolates Root Cause of Raptor Lake Stability Issues to a Faulty eTVB Microcode Algorithm

#76
Dr_b_
FoulOnWhiteIntel were competing though, AMD had to resort to slapping some cache on the top to compete. Without the 3Dvcache it's AMD who would be behind. In a straight non Vcahce contest regardless of power Intel is better in everything.
What is AMD better at? AVX-512, no e-Cores or problems shifting loads, price on some parts, power consumption which equals less heat and power, and of course it actually does have a vcache part that is more performant in gaming, Zen5 also has more PCIe lanes than alder/raptor-lakes.

Its not really clear how intel is "better in everything", what were you referring to specifically?
trparkyYeah, but at what cost? In a lot of cases, our electricity bills and potential future silicon degradation.
yeah its a serious issue that can not be discounted, power consumption and heat on these things is out of control
Posted on Reply
#77
FoulOnWhite
trsttteHow is it not their own tech? They are the ones to have the idea to add cache on top of the cpu and designed a working model of that idea, then used TSMC fabrication technology to put that into practice. Just like Intel is doing with foveros and emib except intel is vertically integrated with their own fabs so they have to design both parts of the solution. If AMD did nothing and just used someone else's tech how come they're the only ones doing it?

If you want to use that stupid argument, well neither of them does anything, they're all just using what ASML makes possible with their machines, it's a ridiculous idea.
yeah it was AMD's idea sure

www.techinsights.com/blog/amd-ships-3d-v-cache-processors

The company used two TSMC innovations to create it.


www.techpowerup.com/review/amd-ryzen-7-5800x3d/2.html

Without TSMC it would not exist.
Posted on Reply
#78
Tomorrow
FoulOnWhiteWithout TSMC it would not exist.
That's the same BS argument that Zen would not be successful without TSMC. Then i need to remind people that Zen actually started from GlobalFoundries 14nm process before transitioning to TSMC with Zen 2 (3000 series). Sure it made it better because it was 7nm vs 14nm first and foremost but the groundwork had already been laid.

Also slapping a heap of cache on top of the die is not a guaranteed success. HUB has videos exploring various Intel CPU's with varying amount of cache and while bigger=better helps it's not as universal for Intel's the architecture as higher clock speeds.

Also 3D V-Cache is not an AMD exclusive technology. Other TSMC customers can also use it, including Intel.
Die-thinning and TSV's are also not purely TSMC's innovation as TSV's had been used in HBM memory before that by Korean memory makers.

Both AMD and Nvidia (i believe Intel too) are also using another TSMC technology that's in the news: CoWoS.
I dont see you downplaying them for some reason - just AMD.
Posted on Reply
#80
Dr. Dro
TomorrowThat's the same BS argument that Zen would not be successful without TSMC. Then i need to remind people that Zen actually started from GlobalFoundries 14nm process before transitioning to TSMC with Zen 2 (3000 series). Sure it made it better because it was 7nm vs 14nm first and foremost but the groundwork had already been laid.

Also slapping a heap of cache on top of the die is not a guaranteed success. HUB has videos exploring various Intel CPU's with varying amount of cache and while bigger=better helps it's not as universal for Intel's the architecture as higher clock speeds.

Also 3D V-Cache is not an AMD exclusive technology. Other TSMC customers can also use it, including Intel.
Die-thinning and TSV's are also not purely TSMC's innovation as TSV's had been used in HBM memory before that by Korean memory makers.

Both AMD and Nvidia (i believe Intel too) are also using another TSMC technology that's in the news: CoWoS.
I dont see you downplaying them for some reason - just AMD.
But Zen would NOT be successful without TSMC. GlobalFoundries does not have a modern manufacturing process suitable to build these processors on, and more cache does not necessarily mean better, in fact, there are several scenarios where the Ryzen X3D chips regress in comparison to the standard models. This occurs because 3D V-Cache incurs a cycle penalty and data takes longer to be processed, which means the standard model is better if the data set fits within its capacity. Also, 3D V-Cache is an AMD technology, TSMC is just a foundry and builds chips to the specification of their customers.

Intel's 3D technology is called Foveros, which was first seen in the Lakefield processor. It can be used to integrate every component in an SoC. Lakefield was very much some sort of proof-of-concept that made to the market (released as a mobile Core i5 in very limited quantities for one certain Samsung laptop) and as an example, featured one P-core, four E-cores (both of the first-generation kind, similar to seen in Rocket Lake), GPU and DRAM fully integrated on-die. It was some sort of Alder Lake prototype, in a certain way.

www.anandtech.com/show/16823/intel-accelerated-offensive-process-roadmap-updates-to-10nm-7nm-4nm-3nm-20a-18a-packaging-foundry-emib-foveros/4

CoWoS stands for Chip on Wafer on Substrate, and it's got nothing to do with 3D stacking technology, it's similar to Intel's EMIB, it's a 2.5D system.

3dfabric.tsmc.com/english/dedicatedFoundry/technology/cowos.htm



The breakthrough will be combining this 2.5D packaging with 3D stacked dies to maximize density.
trparkyAccording to a report over at Techspot.com, Intel still doesn't know what's going on with the Core i9. My thoughts are that this is simply of symptom of Intel pushing a 15-year-old microarchitecture way past the breaking point.

At this point, I think Intel needs to recall every single last Core i9 ever sold and to issue refunds for selling what is a defective product.

Intel still doesn't know what is causing its i9 desktop chips to crash | TechSpot
Raptorlake is Nehalem rehashed 15 times over every year in the same way Zen 4 is a direct descendant of the K5, yes. :kookoo:

I wasn't affected, but I can easily see where it's all going wrong: bad motherboards, bad real-world operating conditions, and underlying microcode bugs... no wonder it's the i9's that have a problem and i7's with more down to earth clocks and no fancy thermal boost are largely immune.
Posted on Reply
#81
Tomorrow
Dr. DroBut Zen would NOT be successful without TSMC.
Zen was successful already on 14nm. 7nm by TSMC just made it better.
Dr. DroGlobalFoundries does not have a modern manufacturing process suitable to build these processors on,
We dont know if GF would be competitive had they not axed their sub 10nm plans.
Dr. Droand more cache does not necessarily mean better, in fact, there are several scenarios where the Ryzen X3D chips regress in comparison to the standard models.
Mostly clock speeds.
Dr. DroThis occurs because 3D V-Cache incurs a cycle penalty and data takes longer to be processed, which means the standard model is better if the data set fits within its capacity.
This penalty is very small. Standard models excel in tasks that benefit from raw clock speed.
Dr. DroCoWoS stands for Chip on Wafer on Substrate, and it's got nothing to do with 3D stacking technology, it's similar to Intel's EMIB, it's a 2.5D system.
I was not comparing the two. I was giving an example of another technology that all three companies use.
Posted on Reply
#82
AusWolf
Dr. DroBut Zen would NOT be successful without TSMC.
Does it matter, though?

Nvidia wouldn't be successful without TSMC and Samsung, either. So what?
Posted on Reply
#83
Dr. Dro
AusWolfDoes it matter, though?

Nvidia wouldn't be successful without TSMC and Samsung, either. So what?
A is true because B is true; so that means B is true because A is true :kookoo:

I do not see the correlation with other customers' portfolio and the fact that... you couldn't build a modern Zen CPU on Globalfoundries' latest node
Posted on Reply
#84
AusWolf
Dr. DroA is true because B is true; so that means B is true because A is true :kookoo:

I do not see the correlation with other customers' portfolio and the fact that... you couldn't build a modern Zen CPU on Globalfoundries' latest node
AMD relies on TSMC for their CPUs, which is bad. Nvidia relies on TSMC for their GPUs, which is good. Am I the only one seeing a massive gaping contradiction here? :kookoo:
Posted on Reply
#85
the54thvoid
Super Intoxicated Moderator
This is the topic:

Intel Isolates Root Cause of Raptor Lake Stability Issues to a Faulty eTVB Microcode Algorithm

Please stick to it and stop the pointless tribal bickering.
Posted on Reply
#87
#22
I would like to finally see example of somebody getting instability issues after having everything set correctly from the start. Maybe even not that hardcore as using 125W PL1, but having all or even majority settings from Intel's blue tablet like this thing shows and turned off mobo's inventions like e.g. multi core enhancement. Boards are known for stupid "default" ideas for long and to the point like you can't trust them, checking CPU behaviour being from the first things to do after building a computer.
Posted on Reply
#88
N/A
It's not just the power nor temperature, no CPU should ever be allowed to boost at 1,45 volts. my comfort limit for 7Nm would be 1,35V and 1,25 for 2Nm and onwards.
Posted on Reply
#89
ir_cow
For once my old school overclocking of x freq x voltage is better :). Never have to worry about the boosting problems.
Posted on Reply
#90
InVasMani
I'm getting the impression that ICCMAX defaults and/or recommendations is one of the bigger instability faults. Intel really should've included ICCMAX in a easy to find location on it's product page for it's chip SKU's instead of buried in a obscure PDF file somewhere that you can maybe find on the dark web region of it's website if you're a internet archive website archeologist. Intel should know better than that. It's a huge oversight on their behalf to not do so and that will probably be argued against them in any class action lawsuits that this whole chip broken fiasco.

If they can figure it out and come up with a real solution and w/o it arbitrarily impacting performance in a meaningful way that would be ideal and nice, but I have my reservations about that actually happening. It seems a lot like another spectre meltdown situation of sorts. That said they got away with that mostly unscathed. I could still cope with that honestly, but I got a great deal on my CPU if I'd paid thru the nose for a 14900K I'd wouldn't be too thrilled by it even if it is just a minor scaling back of relative performance that's already very abundant.
Posted on Reply
#91
AusWolf
I'm starting to get the feeling that buyers of high-end CPUs or GPUs need to be prepared for disaster these days. RTX 3090s burning down with that Amazon game I can't remember, cooler issues with AMD-made 7900 XTXes, ASUS motherboards frying X3D CPUs, and then this malarkey with i9 stability... This is what you get in a world when every single soul and every company wants to be 1% ahead in everything all the time, I guess.
Posted on Reply
#92
InVasMani
That whole new PSU connector fiasco as well. One of my M.2's also mysterious cooked itself and label looked melted. Either that M.2 heat spreader label was conductive and shorted itself or something else went wrong it to do with the PCIE 5.0 slot perhaps though my older gen 3.0 M.2 in that slot's been just fine. I think when I bought it the label was dodgy and I installed it anyway and it worked fine initially then fried after a month or two of some heating and cooling cycles. I could've sworn one looked a bit funky and almost returned it immediately, but didn't and decided to just try it anyway. Certainly won't be taking that chance again in the future. It be worse though at least it wasn't a catastrophic PSU failure.
Posted on Reply
#93
Airbrushkid
Question for you all. So the 14th gen Core i9 is not affected by this mess up?
Posted on Reply
#94
Carillon
AirbrushkidQuestion for you all. So the 14th gen Core i9 is not affected by this mess up?
They are affected
Posted on Reply
#95
AusWolf
AirbrushkidQuestion for you all. So the 14th gen Core i9 is not affected by this mess up?
It's in the title:

Raptor Lake Stability Issues

14th gen is Raptor Lake (as well as 13th gen).
Posted on Reply
#96
Airbrushkid
Yes, but what I read in other sites is the 13th + 14th Gen i5 and i7. But no where do they bring up or mention i9. Sorry but am old.
AusWolfIt's in the title:

Raptor Lake Stability Issues

14th gen is Raptor Lake (as well as 13th gen).
Posted on Reply
#97
Chomiq

Wendell has interesting analysis using the telemetry data from two game studios and feedback from data center companies and system integrators. Not only we see increased number of failures for 13900K and 14900K systems not only on consumer side but also on the server side, where they're often used for hosting game servers that make use of high single core performance at stock settings using the W680 boards.

It reaches a point where game server hosting companies will charge you extra $1000 of support if you opt for Intel:


Posted on Reply
#98
chrcoluk
AirbrushkidYes, but what I read in other sites is the 13th + 14th Gen i5 and i7. But no where do they bring up or mention i9. Sorry but am old.
Other way round, TVB is on i9 chips only.
Posted on Reply
#99
InVasMani
chrcolukOther way round, TVB is on i9 chips only.
They may have meant with stability issues which TVB would probably just exacerbate the problem further on the i9. Right now we haven't gotten a clear indication as to what the root of the problem is. One of the things I've speculated is maybe the socket bending issues is part of the problem. That would absolutely be a larger issue with Datacenter Service Providers purchasing pre-made's since they wouldn't normally being installing anti-bending brackets. In fact Wendell could probably try to look at some cross comparison analysis between what DataCenter Service Providers are seeing versus like Steam or a larger gaming company to look at.

I would think the case of gaming at least you'd see a stronger likelihood of at least some of them using anti-bending brackets more so than with DataCenter so then digging further if the incidents of problem actually higher it might be a good indicator that the socket bending issue is a underlying culprit possibly. I'd say especially so given Gamer's are more likely to also overclock and push memory clock speeds and things higher so you'd actually expect instability to be inherently worse by a decent amount just based on that fact alone.

On the other hand if the data is more the opposite and much higher with like data around gaming and telemetry of that it might point more towards memory and/or ring bus perhaps possibly even the cache and just IMC in general and pushed far beyond general Intel recommendations around memory support. That most gamer's are pretty guilty of doing.

The fact that we still don't have a legitimate answer yet is crazy though. I mean this issues impacted people since 13th gen. How have they not pin pointed a cause by now? It's understandable that some finger pointing has happened at MB maker's with questionable bios decisions honestly and they fully deserve that criticism in light of a situation like this especially. It's a wake up call not do stupid questionable things with default settings. Anyways yeah is what it is, but insane that we still have no answers though we've got some insight into the widespread severity of the problem.
Posted on Reply
#100
trparky
InVasManiHow have they not pin pointed a cause by now?
Maybe they're just trying to sweep it under the rug until the next generation of chips come out. Admitting fault would be devastating to their stock price and we all know that they can't have that. Right?
Posted on Reply
Add your own comment
Nov 29th, 2024 07:37 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts