Friday, June 14th 2024
Intel Isolates Root Cause of Raptor Lake Stability Issues to a Faulty eTVB Microcode Algorithm
Intel has identified the root cause for stability issues being observed with certain high-end 13th- and 14th Gen Core "Raptor Lake" processor models, which were causing games and other compute-intensive applications to randomly crash. When the issues were first identified, Intel recommended a workaround that would reduce core-voltages and restrict the boost headroom of these processors, which would end up with reduced performance. The company has apparently discovered the root cause of the problem, as Igor's Lab learned from confidential documents.
The documents say that Intel isolated the problem to a faulty value in the microcode's end of the eTVB (enhanced thermal velocity boost) algorithm. "Root cause is an incorrect value in a microcode algorithm associated with the eTVB feature. Implication Increased frequency and corresponding voltage at high temperature may reduce processor reliability. Observed Found internally," the document says, mentioning "Raptor Lake-S" (13th Gen) and "Raptor Lake Refresh-S" (14th Gen) as the affected products.The company goes on to elaborate on the issue in its Failure Analysis (FA) document:
Source:
Igor's Lab
The documents say that Intel isolated the problem to a faulty value in the microcode's end of the eTVB (enhanced thermal velocity boost) algorithm. "Root cause is an incorrect value in a microcode algorithm associated with the eTVB feature. Implication Increased frequency and corresponding voltage at high temperature may reduce processor reliability. Observed Found internally," the document says, mentioning "Raptor Lake-S" (13th Gen) and "Raptor Lake Refresh-S" (14th Gen) as the affected products.The company goes on to elaborate on the issue in its Failure Analysis (FA) document:
Failure Analysis (FA) of 13th and 14th Generation K SKU processors indicates a shift in minimum operating voltage on affected processors resulting from cumulative exposure to elevated core voltages. Intel analysis has determined a confirmed contributing factor for this issue is elevated voltage input to the processor due to previous BIOS settings which allow the processor to operate at turbo frequencies and voltages even while the processor is at a high temperature. Previous generations of Intel K SKU processors were less sensitive to these type of settings due to lower default operating voltage and frequency.Identifying the root cause of the problem isn't the only good news, Intel also has a new microcode ready for 13th Gen and 14th Gen Core processors (version: 0x125), for motherboard manufacturers and PC OEMs to encapsulate into UEFI firmware updates. This new microcode corrects the issue, which should restore stability of these processors at their normal performance. Be on the lookout for UEFI firmware (BIOS) updates from your motherboard vendor or prebuilt OEM.
107 Comments on Intel Isolates Root Cause of Raptor Lake Stability Issues to a Faulty eTVB Microcode Algorithm
Its not really clear how intel is "better in everything", what were you referring to specifically? yeah its a serious issue that can not be discounted, power consumption and heat on these things is out of control
www.techinsights.com/blog/amd-ships-3d-v-cache-processors
The company used two TSMC innovations to create it.
www.techpowerup.com/review/amd-ryzen-7-5800x3d/2.html
Without TSMC it would not exist.
Also slapping a heap of cache on top of the die is not a guaranteed success. HUB has videos exploring various Intel CPU's with varying amount of cache and while bigger=better helps it's not as universal for Intel's the architecture as higher clock speeds.
Also 3D V-Cache is not an AMD exclusive technology. Other TSMC customers can also use it, including Intel.
Die-thinning and TSV's are also not purely TSMC's innovation as TSV's had been used in HBM memory before that by Korean memory makers.
Both AMD and Nvidia (i believe Intel too) are also using another TSMC technology that's in the news: CoWoS.
I dont see you downplaying them for some reason - just AMD.
Do you even know anything about chip design?
Intel's 3D technology is called Foveros, which was first seen in the Lakefield processor. It can be used to integrate every component in an SoC. Lakefield was very much some sort of proof-of-concept that made to the market (released as a mobile Core i5 in very limited quantities for one certain Samsung laptop) and as an example, featured one P-core, four E-cores (both of the first-generation kind, similar to seen in Rocket Lake), GPU and DRAM fully integrated on-die. It was some sort of Alder Lake prototype, in a certain way.
www.anandtech.com/show/16823/intel-accelerated-offensive-process-roadmap-updates-to-10nm-7nm-4nm-3nm-20a-18a-packaging-foundry-emib-foveros/4
CoWoS stands for Chip on Wafer on Substrate, and it's got nothing to do with 3D stacking technology, it's similar to Intel's EMIB, it's a 2.5D system.
3dfabric.tsmc.com/english/dedicatedFoundry/technology/cowos.htm
The breakthrough will be combining this 2.5D packaging with 3D stacked dies to maximize density. Raptorlake is Nehalem rehashed 15 times over every year in the same way Zen 4 is a direct descendant of the K5, yes. :kookoo:
I wasn't affected, but I can easily see where it's all going wrong: bad motherboards, bad real-world operating conditions, and underlying microcode bugs... no wonder it's the i9's that have a problem and i7's with more down to earth clocks and no fancy thermal boost are largely immune.
Nvidia wouldn't be successful without TSMC and Samsung, either. So what?
I do not see the correlation with other customers' portfolio and the fact that... you couldn't build a modern Zen CPU on Globalfoundries' latest node
Intel Isolates Root Cause of Raptor Lake Stability Issues to a Faulty eTVB Microcode Algorithm
Please stick to it and stop the pointless tribal bickering.If they can figure it out and come up with a real solution and w/o it arbitrarily impacting performance in a meaningful way that would be ideal and nice, but I have my reservations about that actually happening. It seems a lot like another spectre meltdown situation of sorts. That said they got away with that mostly unscathed. I could still cope with that honestly, but I got a great deal on my CPU if I'd paid thru the nose for a 14900K I'd wouldn't be too thrilled by it even if it is just a minor scaling back of relative performance that's already very abundant.
Raptor Lake Stability Issues
14th gen is Raptor Lake (as well as 13th gen).Wendell has interesting analysis using the telemetry data from two game studios and feedback from data center companies and system integrators. Not only we see increased number of failures for 13900K and 14900K systems not only on consumer side but also on the server side, where they're often used for hosting game servers that make use of high single core performance at stock settings using the W680 boards.
It reaches a point where game server hosting companies will charge you extra $1000 of support if you opt for Intel:
I would think the case of gaming at least you'd see a stronger likelihood of at least some of them using anti-bending brackets more so than with DataCenter so then digging further if the incidents of problem actually higher it might be a good indicator that the socket bending issue is a underlying culprit possibly. I'd say especially so given Gamer's are more likely to also overclock and push memory clock speeds and things higher so you'd actually expect instability to be inherently worse by a decent amount just based on that fact alone.
On the other hand if the data is more the opposite and much higher with like data around gaming and telemetry of that it might point more towards memory and/or ring bus perhaps possibly even the cache and just IMC in general and pushed far beyond general Intel recommendations around memory support. That most gamer's are pretty guilty of doing.
The fact that we still don't have a legitimate answer yet is crazy though. I mean this issues impacted people since 13th gen. How have they not pin pointed a cause by now? It's understandable that some finger pointing has happened at MB maker's with questionable bios decisions honestly and they fully deserve that criticism in light of a situation like this especially. It's a wake up call not do stupid questionable things with default settings. Anyways yeah is what it is, but insane that we still have no answers though we've got some insight into the widespread severity of the problem.