Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#226
StrikerRocket
Well, Asus went the whole hog and implemented 6 MLCCs in their design, which simply suggests Nvidia partners *knew* about possible weaknesses in this area...
What is the only logical conclusion here? I leave it to you...
Posted on Reply
#227
clopezi
StrikerRocketWell, Asus went the whole hog and implemented 6 MLCCs in their design, which simply suggests Nvidia partners *knew* about possible weaknesses in this area...
What is the only logical conclusion here? I leave it to you...
Asus TUF and FE also have problems...


I hope that tomorrow Nvidia relases an statement of this issue and add some information about it, because today all is rumour and noise... if FE are having same problems, I want to doubt that this is a HW problem...
Posted on Reply
#228
Julhes
clopeziAsus TUF and FE also have problems...


I hope that tomorrow Nvidia relases an statement of this issue and add some information about it, because today all is rumour and noise... if FE are having same problems, I want to doubt that this is a HW problem...
A souce to give us?
Posted on Reply
#229
BoboOOZ
JulhesA souce to give us?
www.techpowerup.com/forums/threads/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice.272591/post-4357012
mtcn77It is not. I'm actually attributing it to Samsung. I think you would remember ChipGate?
www.techturtle.net/after-last-years-bendgate-its-now-chipgate-for-apple/
Whatever it is, it seems to be related to the quality of the node. Whether downclocking or overvolting will be the result, it seems to come from the fact that Nvidia is asking more from the silicon than it is capable of giving. The correction will most likely result in either somewhat increased TDP or somewhat decreased performance, or a bit of both. There will be no recall for this, because the clock speeds at which these problems arise are way above what is advertised on the box.
Posted on Reply
#230
trsskater63
BoboOOZwww.techpowerup.com/forums/threads/rtx-3080-crash-to-desktop-problems-likely-connected-to-aib-designed-capacitor-choice.272591/post-4357012


Whatever it is, it seems to be related to the quality of the node. Whether downclocking or overvolting will be the result, it seems to come from the fact that Nvidia is asking more from the silicon than it is capable of giving. The correction will most likely result in either somewhat increased TDP or somewhat decreased performance, or a bit of both. There will be no recall for this, because the clock speeds at which these problems arise are way above what is advertised on the box.
I agree with you. They are only required to give what the box promises and gpu boost is designed to just make your purchase more valuable. So there solution will surely be a down clock. It looks like someone tested doing his own overclock and he was still able to get an increase of 4 to 6 fps in games without crashing. They probably did juice these cards from the start to make the generational leap as large as it is.
Posted on Reply
#231
SoftwareRocketScientist
Why are you guys complaining? You chose to be a beta test for Nvidia. Even when you download a new driver there is a box you can check if you wish to send them data about your crashes. Of course it was a rush to market like any other manufacturer of any product to increase their stock price and demand to satisfy the board of their shareholders. We saw the result of that in crashes. They didn’t do enough alpha and beta test. The story repeats itself.
Posted on Reply
#232
BoboOOZ
SoftwareRocketScientistWhy are you guys complaining? You chose to be a beta test for Nvidia.
Nobody signed to be a beta tester, the advertising was "It just works!"
Posted on Reply
#233
Frick
Fishfaced Nincompoop
SoftwareRocketScientistWhy are you guys complaining? You chose to be a beta test for Nvidia. Even when you download a new driver there is a box you can check if you wish to send them data about your crashes. Of course it was a rush to market like any other manufacturer of any product to increase their stock price and demand to satisfy the board of their shareholders. We saw the result of that in crashes. They didn’t do enough alpha and beta test. The story repeats itself.
GPUs have been released without these issues for a very long time. Sometimes there have been problems (remember the 8800GT, even though those problems didn't arise when they were new) but in general GPUs have been pretty solid things, as they should be. Electronics in power delivery is a solved problem.
BoboOOZNobody signed to be a beta tester, the advertising was "It just works!"
Was it though? Because that would be silly. It's a finished product. If it doesn't work it's defective, saying it works is like saying it isn't defective.
Posted on Reply
#234
BoboOOZ
FrickWas it though? Because that would be silly. It's a finished product. If it doesn't work it's defective, saying it works is like saying it isn't defective.
Hey, you don't like the jingle, take it with Jensen :p.
Posted on Reply
#235
Totally
StrikerRocketWell, Asus went the whole hog and implemented 6 MLCCs in their design, which simply suggests Nvidia partners *knew* about possible weaknesses in this area...
What is the only logical conclusion here? I leave it to you...
From the little info that was leaked the partners they get a reference design with recommended and minimum specs from nvidia they tweak that according to their design goals then make the card and test it some found out they were having issues but at this point it was past the point of no return. Some band-aid implemented fixes (zotac) and others shipped as-is or were unaware. Since they seem to be under a gag order there is no telling who knew what.
Posted on Reply
#236
trsskater63
SoftwareRocketScientistWhy are you guys complaining? You chose to be a beta test for Nvidia. Even when you download a new driver there is a box you can check if you wish to send them data about your crashes. Of course it was a rush to market like any other manufacturer of any product to increase their stock price and demand to satisfy the board of their shareholders. We saw the result of that in crashes. They didn’t do enough alpha and beta test. The story repeats itself.
If anything turing was the beta test for this whole thing. And it didn't have problems. It just wasn't good enough. If a new architecture makes you a beta tester for it then we have been beta testers for the pass 6 years at least from as far back as I have been watching these cards with Nvidia. When are we going to be out of beta?
Posted on Reply
#237
StrikerRocket
clopeziAsus TUF and FE also have problems...


I hope that tomorrow Nvidia relases an statement of this issue and add some information about it, because today all is rumour and noise... if FE are having same problems, I want to doubt that this is a HW problem...
Ok then, I just misread or misunderstood something... If that is true, then this might only be a part of the problem, and a big embarassment for card manufacturers!
Did they push the silicon too far from the start? Then, they will have to release updated firmwares with some downclock or something like that...
Sticking with my 1080 for now, I'll wait till the 20 series sells dirt cheap on eBay before pulling the trigger!
Posted on Reply
#238
mtcn77
I get the feeling AMD could help out Nvidia a little bit. Though sense m.i. is proprietary, it does work to curtail power. Threadripper exists for the sole reason that PBO is working as intended. Sudden spikes are very damaging to the customer base, as noted here.
Posted on Reply
#239
OneMoar
There is Always Moar
People are having a hard time grasping how many corners AIBS cut with the boards these days

basically every corner they cut they can and its just enough to push the stability envelope past the limit when you are trying to hit that magical 2Ghz marketing number

ever since Nvidia started making there own boards the AIBS have been cutting every corner possible if you want a reliable card, you buy the Nvidia made one(unless you wanna spend the money for the absolute top tier cards like a Strix, Hall of Fame, k1ngpin)

this is a complete reversal from how it used to be

it used to be AIB cards offered more bang for the buck with better overclocking and better cooling this is frankly no longer the case and the short of it is unless Nvidia relaxes some of the restrictions AIBS only continued reason to exist is to make cards for Nvidia
Posted on Reply
#240
kiriakost
mtcn77I get the feeling AMD could help out Nvidia a little bit. Though sense m.i. is proprietary, it does work to curtail power. Threadripper exists for the sole reason that PBO is working as intended. Sudden spikes are very damaging to the customer base, as noted here.
I do prefer this to simply not happen, AMD it should focus their power at their own products which them wasting more on-board card memory than what an NVIDIA card this using at the same game and at the same resolution.

We are not shooting here NVIDIA's legs, we simply trying to get a bit of encyclopedia understanding of what when wrong.
Posted on Reply
#241
mtcn77
kiriakostI do prefer this to simply not happen, AMD it should focus their power at their own products which them wasting more on-board card memory than what an NVIDIA card this using at the same game and at the same resolution.

We are not shooting here NVIDIA's legs, we simply trying to get a bit of encyclopedia understanding of what when wrong.
A few years back, Linus made a comment that Nvidia's polling method was precise but less frequent than AMD Radeon's faster guesswork. Nvidia's number was 33microsecondmilliseconds latency, afaik.
This could be related to slow responses to monitored events, wouldn't you say? We are talking about 2.5ghz chips that were previously impossible when these monitoring software first took over.
www.extremetech.com/gaming/170542-amds-radeon-r9-290-has-a-problem-but-nvidias-smear-attack-is-heavy-handed/2

PS: I indeed think it is as simple as that, vdroop that is occurring quicker than 30 times a second which is above the monitoring resolution. As with cpu overclocking, a higher base voltage, or LLC would further complicate the power requirements. The solution is definitely good, but it has to be inside the frameset of parametrization. Something is voiding the algorithm.
Posted on Reply
#242
Unregistered
StrikerRocketOk then, I just misread or misunderstood something... If that is true, then this might only be a part of the problem, and a big embarassment for card manufacturers!
Did they push the silicon too far from the start? Then, they will have to release updated firmwares with some downclock or something like that...
Sticking with my 1080 for now, I'll wait till the 20 series sells dirt cheap on eBay before pulling the trigger!
Sticking with my 1080 for now, me too. mine does 2140/5670 fine in my custom loop.
Posted on Edit | Reply
#243
kiriakost
OneMoarPeople are having a hard time grasping how many corners AIBS cut with the boards these days

basically every corner they cut they can and its just enough to push the stability envelope past the limit when you are trying to hit that magical 2Ghz marketing number
This text does not make any sense, and from now and on all of you, please use AIBS or AIB acronym at it full form so confusion to be avoided.
a) AIB to refer to 'non reference' graphics card designs.
b) An AIB supplier or an AIB partner is a company that buys the AMD (or Nvidia) Graphics Processor Unit to put on a board and then bring a complete and usable Graphics Card or AIB to market.
Posted on Reply
#244
Shatun_Bear
StrikerRocketWell, Asus went the whole hog and implemented 6 MLCCs in their design, which simply suggests Nvidia partners *knew* about possible weaknesses in this area...
What is the only logical conclusion here? I leave it to you...
There is evidence that it's not just the MLCC to blame. It could be a couple of hardware faults plus Nvidua driver problems on top.
Posted on Reply
#245
trsskater63
OneMoarPeople are having a hard time grasping how many corners AIBS cut with the boards these days

basically every corner they cut they can and its just enough to push the stability envelope past the limit when you are trying to hit that magical 2Ghz marketing number

ever since Nvidia started making there own boards the AIBS have been cutting every corner possible if you want a reliable card, you buy the Nvidia made one(unless you wanna spend the money for the absolute top tier cards like a Strix, Hall of Fame, k1ngpin)

this is a complete reversal from how it used to be

it used to be AIB cards offered more bang for the buck with better overclocking and better cooling this is frankly no longer the case and the short of it is unless Nvidia relaxes some of the restrictions AIBS only continued reason to exist is to make cards for Nvidia
Do you have any evidence of this? I'm trying to look this up and I can't find any information for or against it. I know companies will cut corners where ever possible but it seems in my experience the AIB have always been better performance and I buy a new graphics card every year. I have had gigabyte wind force, assus strix, and msi gaming x over the last 5. Are those cards normally done better? I usually get good temperatures and over clocks with them.
Posted on Reply
#246
kiriakost
mtcn77A few years back, Linus made a comment that Nvidia's polling method was precise but less frequent than AMD Radeon's faster guesswork. Nvidia's number was 33microsecond latency, afaik.
This could be related to slow responses to monitored events, wouldn't you say? We are talking about 2.5ghz chips that were previously impossible when these monitoring software first took over.
At 1996 my first graphic card this was able to do 2D and 25 fps of video and no 3D as this came later.
20 years ago we did complain (me too) that NVIDIA was flooding the market with VGA card releases when the performance at positive scaling was just 12%.
Series TNT and then TNT2 and and and .... more money spend with out real benefit.
Since 2012 I did stop to be a close follower of 3D cards development, I did use the storage ability of my brain at other by far more productive thoughts.

Development of software this is always a second support step, if NVIDIA did not add relative power usage monitor sensors, no one would be able to see power related information's (electrical measurements).
Posted on Reply
#247
mtcn77
kiriakostDevelopment of software this is always a second support step, if NVIDIA did not add relative power usage monitor sensors, no one would be able to see power related information's (electrical measurements).
The sensors have temporal resolution which is what I'm saying. Props to Nvidia again, I used to go heavy on overclocking methods. This one, throttle near the voltage threshold limit, is the best(none saves power while still at the maximum performance), but the drawback is you have to act quick.
If only we still had @The Stilt around.

The method I would suggest at a big die gpu is still the same - try incrementally at 50MHz steps and see if there is a cutoff point where this behaviour starts. 1600 MHz, 1650 MHz, 1700 MHz... I'm not a metrologist which I highly respect as a science, but I can at least go down to the minimum resolution(1 MHz) until problem begins.
I used to combine ATi Tray Tools since not most software came with its error counter. I would monitor the gpu frequency time log in its overclock test and watch for the card to spit out errors in ATT(you had to select osd error check to monitor it live on the lower corner).
It was great fun, but such old software has a habit of damaging your card when continuously running at 5000fps, lol.
I cannot be of much other help outside of pointing out which software I used to get a frame of reference.
I hope they fix it because it rekindles the good old times I spent dialing just single digits in MSI Afterburner.
Posted on Reply
#248
kiriakost
mtcn77The sensors have temporal resolution which is what I'm saying. Props to Nvidia again, I used to go heavy on overclocking methods. This one, throttle near the voltage threshold limit, is the best(none saves power while still at the maximum performance), but the drawback is you have to act quick.
If only we still had @The Stilt around.

The method I would suggest at a big die gpu is still the same - try incrementally at 50MHz steps and see if there is a cutoff point where this behaviour starts. 1600 MHz, 1650 MHz, 1700 MHz... I'm not a metrologist which I highly respect as a science, but I can at least go down to the minimum resolution(1 MHz) until problem begins.
I used to combine ATi Tray Tools since not most software came with its error counter. I would monitor the gpu frequency time log in its overclock test and watch for the card to spit out errors in ATT(you had to select osd error check to monitor it live on the lower corner).
It was great fun, but such old software has a habit of damaging your card when continuously running at 5000fps, lol.
I cannot be of much other help outside of pointing out which software I used to get a frame of reference.
I hope they fix it because it rekindles the good old times I spent dialing just single digits in MSI Afterburner.
Well I am not an electrical metrologist either, they respect the rules of science and they never ever do overclocking. :D
My current AMD HD5770 this has an internal scanning method so to determine max OC limits all by it self.
I never care to learn the scaling up in Megahertz steps.

RTX 3000 and what ever will follow after it, this is a different animal, I am sensing that Major OC software and utilities these will not be required any more.
This new hardware it is now made to restrict careless handling from the side of users.
Its a new car with no gears stick, and with a limiter at allowed top speed.
Anyone whom disagree with the new reality he should never get one.
Posted on Reply
#249
mtcn77
kiriakostMy current AMD HD5770 this has an internal scanning method so to determine max OC limits all by it self.
I never care to learn the scaling up in Megahertz steps.
Thanks for making it easier for me to give an example.
Since this is about power delivery, it has to "match" the power requirement of the normal operating bevaviour.
Since the testing utility is good, but doesn't test at the same temperature ranges an overclocked case can rise up to, we'll have to reserve ourselves to more moderate speeds than what the utility can have us believe.
This is why I mentioned ATT, it follows the same fan temperature curve in the normal operating behaviour.
This is mainly about vdroop and temperature beyond that. The way I used it was, I would start the 3d renderer, let go of tuning a little bit, and wait until the card reached previously noted temperature points where it destabilized and switch to 'manual' from there(never expected ATT to sound this cool).
The leakiness brought about by temperature, inclining power requirements due to faster shader operation would get you a sweet spot where this cutoff was too easy to pinpoint. From there, I would play either with the fan curve, or voltage, or if on the cpu with LLC(you couldn't pick its temperature gradient if you didn't log everything up until here), but basically I find it more exciting to bust cards using this method than to use them daily, lol.
Posted on Reply
#250
John Naylor
And yet again ... the folks who have to be the 1st one on the block to get the new shiny thing get hammered. Almost every new generation has problems, some big some small..... some affected 1 brtand, less often all in the series. Sometimes they are easy to fix (i.e. MSI's extremely aggressive adhesive on the tape holding the fans still during shipping); sometimes they are significant but fixable (i.e. EVGAs missing thermal pads on the 1xxx series); sometimes they require design changes (i.e 1/3 of EVGAs 9xx series heat sink "missing" the GPU. Sometimes these just effect one AIB design ... sometimes they are series wide like AMDs inadequate 6 pin connector on the 480.

As the saying goes .... good things come to those who wait .... if the PCBs are indeed faulty, they will be redesigned and those who choose to wait won't have to deal with a 1st stepping design issue. The alleged "cutting corners" by AIBs is simply not supported bu history... the AIB offerings, for the most part, have always outperformed the reference and FE designs. Yes, we have the deficient EVGA designs (9xx heat sink, 1xxx missing thermal pads, 2xxx Black series non "A" GPU) which didn't measure up but that's the exception rather than the rule. I have commented a few times that "what did MSI do differently that they are the only card to deliver more fps than the FE. I did note that they had one of the lowest power limits ... perhaps the problem arises when that limit is exceeded ? In any case, hopefully folks who were unable to snag one before they were sold out, will now cancel the orders, sit and wait till the problem is defined, which cards it affects and the issued addressed in later offferings
Posted on Reply
Add your own comment
Dec 19th, 2024 02:07 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts