Friday, September 25th 2020

RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

Igor's Lab has posted an interesting investigative article where he advances a possible reason for the recent crash to desktop problems for RTX 3080 owners. For one, Igor mentions how the launch timings were much tighter than usual, with NVIDIA AIB partners having much less time than would be adequate to prepare and thoroughly test their designs. One of the reasons this apparently happened was that NVIDIA released the compatible driver stack much later than usual for AIB partners; this meant that their actual testing and QA for produced RTX 3080 graphics cards was mostly limited to power on and voltage stability testing, other than actual gaming/graphics workload testing, which might have allowed for some less-than-stellar chip samples to be employed on some of the companies' OC products (which, with higher operating frequencies and consequent broadband frequency mixtures, hit the apparent 2 GHz frequency wall that produces the crash to desktop).

Another reason for this, according to Igor, is the actual "reference board" PG132 design, which is used as a reference, "Base Design" for partners to architecture their custom cards around. The thing here is that apparently NVIDIA's BOM left open choices in terms of power cleanup and regulation in the mounted capacitors. The Base Design features six mandatory capacitors for filtering high frequencies on the voltage rails (NVVDD and MSVDD). There are a number of choices for capacitors to be installed here, with varying levels of capability. POSCAPs (Conductive Polymer Tantalum Solid Capacitors) are generally worse than SP-CAPs (Conductive Polymer-Aluminium-Electrolytic-Capacitors) which are superseded in quality by MLCCs (Multilayer Ceramic Chip Capacitor, which have to be deployed in groups). Below is the circuitry arrangement employed below the BGA array where NVIDIA's GA-102 chip is seated, which corresponds to the central area on the back of the PCB.
In the images below, you can see how NVIDIA and it's AIBs designed this regulator circuitry (NVIDIA Founders' Edition, MSI Gaming X, ZOTAC Trinity, and ASUS TUF Gaming OC in order, from our reviews' high resolution teardowns). NVIDIA in their Founders' Edition designs uses a hybrid capacitor deployment, with four SP-CAPs and two MLCC groups of 10 individual capacitors each in the center. MSI uses a single MLCC group in the central arrangement, with five SP-CAPs guaranteeing the rest of the cleanup duties. ZOTAC went the cheapest way (which may be one of the reasons their cards are also among the cheapest), with a six POSCAP design (which are worse than MLCCs, remember). ASUS, however, designed their TUF with six MLCC arrangements - there were no savings done in this power circuitry area.

It's likely that the crash to desktop problems are related to both these issues - and this would also justify why some cards cease crashing when underclocked by 50-100 MHz, since at lower frequencies (and this will generally lead boost frequencies to stay below the 2 GHz mark) there is lesser broadband frequency mixture happening, which means POSCAP solutions can do their job - even if just barely.
Source: Igor's Lab
Add your own comment

297 Comments on RTX 3080 Crash to Desktop Problems Likely Connected to AIB-Designed Capacitor Choice

#251
trsskater63
John NaylorAnd yet again ... the folks who have to be the 1st one on the block to get the new shiny thing get hammered. Almost every new generation has problems, some big some small..... some affected 1 brtand, less often all in the series. Sometimes they are easy to fix (i.e. MSI's extremely aggressive adhesive on the tape holding the fans still during shipping); sometimes they are significant but fixable (i.e. EVGAs missing thermal pads on the 1xxx series); sometimes they require design changes (i.e 1/3 of EVGAs 9xx series heat sink "missing" the GPU. Sometimes these just effect one AIB design ... sometimes they are series wide like AMDs inadequate 6 pin connector on the 480.

As the saying goes .... good things come to those who wait .... if the PCBs are indeed faulty, they will be redesigned and those who choose to wait won't have to deal with a 1st stepping design issue. The alleged "cutting corners" by AIBs is simply not supported bu history... the AIB offerings, for the most part, have always outperformed the reference and FE designs. Yes, we have the deficient EVGA designs (9xx heat sink, 1xxx missing thermal pads, 2xxx Black series non "A" GPU) which didn't measure up but that's the exception rather than the rule. I have commented a few times that "what did MSI do differently that they are the only card to deliver more fps than the FE. I did note that they had one of the lowest power limits ... perhaps the problem arises when that limit is exceeded ? In any case, hopefully folks who were unable to snag one before they were sold out, will now cancel the orders, sit and wait till the problem is defined, which cards it affects and the issued addressed in later offferings
I'm pretty be sure the fix is going to be an underclock since they still perform above the promise on the box when underclocked. I don't think they would go back and retool their design unless they are going to make a new model that will cost more. Usually the easier fix when acceptable will be the path chosen. I hope they will make a better version. I would like to overclock since I find it fun to do even if it's not really needed. I think what you are saying about the power limit is the issue. Is this the highest power draw from a card to date? Or maybe it's a problem of the high power combined with the die shrink. Since from what I learned about CPUs is that on a die shrink they don't need as much power to function because it becomes more power efficient. But efficiency flew out the window here.
Posted on Reply
#252
SoftwareRocketScientist
Remember what Igor said: “By the way, you also have to praise a company here that recognized the whole thing from the start and didn’t even let it touch them, as the Asus TUF RTX 3080 Gaming consequently did without POSCAPs and only used MLCC groups. My compliments, it fits!” ASUS did a fantastic job. They knew what the problem was at least they predicted right. Their quality control caught the problem and they went the best quality way at $50 more. Lesson learnt.
Posted on Reply
#253
mtcn77
trsskater63I'm pretty be sure the fix is going to be an underclock
I also think a vrm pwm frequency change, or monitoring software polling change might be due. I like to keep my testing simple.
Posted on Reply
#254
Caring1
BoboOOZNobody signed to be a beta tester, the advertising was "It just works!"
Saying it just works is the same as saying it barely works, exact same meaning. ;)
Posted on Reply
#255
Vayra86
Animalpak3000 series looked already too good to be true...
Oh it'll get to the point where its worthwhile, I'm not too worried, as this is too big to fail territory.

But it'll take a while, and time is on our side really. The more and longer Nvidia struggles, the more they will need to watch the AMD space. Lacking supply can also be an easy ticket to switch camps, at some point people do need a GPU even if the one available is not their first choice - which has even yet to be seen, mind.

If Nvidia needs a downclock or limit to peak clocks they're losing % against competition which might just nudge things in Navi's favor. Interesting times! I hope our resident reviewer is happy to revisit those FE's.... :D
John NaylorAnd yet again ... the folks who have to be the 1st one on the block to get the new shiny thing get hammered. Almost every new generation has problems, some big some small..... some affected 1 brtand, less often all in the series. Sometimes they are easy to fix (i.e. MSI's extremely aggressive adhesive on the tape holding the fans still during shipping); sometimes they are significant but fixable (i.e. EVGAs missing thermal pads on the 1xxx series); sometimes they require design changes (i.e 1/3 of EVGAs 9xx series heat sink "missing" the GPU. Sometimes these just effect one AIB design ... sometimes they are series wide like AMDs inadequate 6 pin connector on the 480.

As the saying goes .... good things come to those who wait .... if the PCBs are indeed faulty, they will be redesigned and those who choose to wait won't have to deal with a 1st stepping design issue. The alleged "cutting corners" by AIBs is simply not supported bu history... the AIB offerings, for the most part, have always outperformed the reference and FE designs. Yes, we have the deficient EVGA designs (9xx heat sink, 1xxx missing thermal pads, 2xxx Black series non "A" GPU) which didn't measure up but that's the exception rather than the rule. I have commented a few times that "what did MSI do differently that they are the only card to deliver more fps than the FE. I did note that they had one of the lowest power limits ... perhaps the problem arises when that limit is exceeded ? In any case, hopefully folks who were unable to snag one before they were sold out, will now cancel the orders, sit and wait till the problem is defined, which cards it affects and the issued addressed in later offferings
You'd think folks would know better by now, but no. So its well deserved really. Early adopting is great, as long as its not me ;)
Caring1Saying it just works is the same as saying it barely works, exact same meaning. ;)
The power of emphasis in speech... haha
lexluthermiesterTrue. Finding the optimal balance between cost, quality and value to the end user can be a very serious challenge. Everyone wants to make money and as much as possible. In the case of video card AIBs, they want to make money but also boost their brand. Most actually care about making a quality product and hate it when things like the problems being faced currently happen.
You don't see the inside of big companies a lot do you...

I do... and yes 'they hate it'... until they get in the car and drive home. Its a 9 to 5 job, this hating of the work people have or haven't done, and the bottom line is just people screwing up and management not giving it enough mind to fix it. Or, management killing the workforce with too much work and/or too little time. The assumption everyone can do his job proper is a bad one, the assumption should be 'double check everything or it will likely go wrong'. This is what you do when you release software or code, too. You make sure there is no room for error through well defined processes - and even thén, something minor might just get through the cracks.

ALL of this is self-inflicted, conscious, well calculated risk management - even that last 1% that does get past and goes wrong. The bottom line is cost/benefit, it just doesn't always work out like people think it does. In the end, it is only and always the company producing something that is fully responsible. Nobody should ever have to find excuses for any company making mistakes. They're not mistakes. They were thoroughly looked at, and some people in suits together said 'We'lll run with this', and poof, consumer can start shoveling poop. Meanwhile, a healthy profit margin was already secured as 'the bottom line'...

Case in point here, because the only reason this is happening is because cards get pushed beyond or too close to the edge. That is directly, and only a cost/benefit scenario: performance per dollar. Even despite this capacitor detail, really, which kinda comes on top of it. The fact the line is thís thin, is telling in terms of overall product longevity, as well. That alongside with the heat of memory and several other decisions made with this 3080 really keeps me FAR away from it, so far.

It doesn't look good at all. Its a bit like cheap sports cars. Lots of HP's for not a lot of cash... but your seat is shit, the tank is empty before you've reached the end of the street and after a year you're replacing half the engine.
Posted on Reply
#256
Darmok N Jalad
Late to the party, but with these multi-billion transistor GPUs, I think they get pushed a little harder than they probably should be. I bet new drivers or firmware will just dial back the boost algorithm for the sake of stability. The cards can still push to the advertised boost on an easy task, like Luxmark Mirrorball, but you will rarely see it in games.
Posted on Reply
#257
lexluthermiester
Darmok N JaladI bet new drivers or firmware will just dial back the boost algorithm for the sake of stability.
This is very likely.
Posted on Reply
#258
John Naylor
trsskater63I'm pretty be sure the fix is going to be an underclock since they still perform above the promise on the box when underclocked. I don't think they would go back and retool their design unless they are going to make a new model that will cost more. Usually the easier fix when acceptable will be the path chosen. I hope they will make a better version. I would like to overclock since I find it fun to do even if it's not really needed. I think what you are saying about the power limit is the issue. Is this the highest power draw from a card to date? Or maybe it's a problem of the high power combined with the die shrink. Since from what I learned about CPUs is that on a die shrink they don't need as much power to function because it becomes more power efficient. But efficiency flew out the window here.
Consider ...

AMD did both .... the immediate fix on the 480 was to cut power delivery with BIOS and driver updates, , but later on, vendors switched to 8 pin designs
EVGA did with the 970 ...1st they argued that they 'designed it that way", but later they came out with a new design
EVGA did again, with the malfunctioning 1060 - 108os ... 1st offer was a recall or thermal pad kit you could install yaself ... later all cards came with thermal pads.
lexluthermiesterThis is very likely.
I think that's an automatic ... as above, AMD did the same thing with the 6 pin 480 fiasco ... but they followed with a move to 8 pin cards later on. Same with EVGA mishaps ... I just don't see everyone sitting and leaving this alone .... at the next board meeting, there will be at least one person in the room saying "we need to take thin step to distinguish ourselves above the others" ... but the reality is there will be one of those guys in every boardroom. Im still curious as to why no one was able to beat the FE fps wise .... while most of the other AIBs allowed for greater wattage limit. MSI left theird 20 watts BELOW the DE .... maybe MSI saw something no one else picked up ?
Posted on Reply
#259
Minus Infinity
newtekie1Early adopters are beta testers these days.
Alas that's pretty much true of any product these days and a lot of software. It's a pathetic situation and Nvidia rushed the product out to try and get a alot of hype generated and garner quick sales before Big Navi came along. All to the customers detriment.
Posted on Reply
#260
AsRock
TPU addict
theoneandonlymrkNo company shouts more about their work with partners, Devs and AIB.
The reference spec design they passed AIB was different to their own reference card's.
And they compressed development and testing time to near zero.
And they allowed such design variation in their development reference kit instead of both knowing that it needed specific voltage conditioning and informing AIB partners or limiting those AIB designs.

It's not all on Nvidia but they share the blame.
But if the AIB's actually tested them fully they would of hit the issue, maybe they knew about it and thought fck it.
Posted on Reply
#261
HD64G
FYI, all models can CTD when aprroaching or just surpassing 2GHz...

Posted on Reply
#262
Zubasa
HD64GFYI, all models can CTD when aprroaching or just surpassing 2GHz...

At some point, people need to realize that Ampere just doesn't clock quite as well as Turing at ambient.
Also by default the 3080 are the lower bin GA102 dies.
Posted on Reply
#263
BoboOOZ
ZubasaAt some point, people need to realize that Ampere just doesn't clock quite as well as Turing at ambient.
That's a big fail for the "largest generational leap", though. Reminds me of this great video:
Posted on Reply
#264
nguyen
BoboOOZThat's a big fail for the "largest generational leap", though. Reminds me of this great video:
Samsung 8N is still a superior node than TSMC 12nm, which Nvidia used for Turing and beat the living crap outta Navi 7nm :D. If you think Navi is as efficient as Turing, look at laptop GPU segment where mobile Navi is almost non-existant.

Samsung 8N is fine, they seem to run cooler even with increased power consumption compare to TSMC 12nm FFN.

I expect all these CTDs would be fixed with newer driver, not like the cause of these CTD is that mysterious anyways. As for SPCAP vs MLCC, sounds like Asus did an excellent job with their TUF line, kudo to them, and I guess they can't be making 3080/3090 fast enough. I asked my local retailer and they said they won't have 3090 TUF in stock for at least 2 months ~_~.
Posted on Reply
#265
BoboOOZ
nguyenSamsung 8N is still a superior node than TSMC 12nm, which Nvidia used for Turing and beat the living crap outta Navi 7nm :D. If you think Navi is as efficient as Turing, look at laptop GPU segment where mobile Navi is almost non-existant.

Samsung 8N is fine, they seem to run cooler even with increased power consumption compare to TSMC 12nm FFN.
It's a poor node no matter how you look at it, it runs cool just because the coolers are very high quality and huge. As the video points out, it was a poor choice for Nvidia, I wonder what the yields are on it.

It will also make a horrible node for any mobile GPU, I'm curious to what will Nvidia do to come with reasonable SKU for laptops, because these gobble way too much power as they are.
Posted on Reply
#266
Assimilator
HD64GFYI, all models can CTD when aprroaching or just surpassing 2GHz...

It's almost like computer silicon gets unstable when you clock it past its limits.
Almost like this has been true since silicon has been used in computers.
Almost like overclock instability related to silicon limits has nothing to do with capacitor choice.
Almost like this is a non-issue that has been blown way out of proportion.

As for those people who will say "but some people get over 2GHz": silicon lottery.
As for those people who will say "but MUH CLOCKS NVIDIA IS RIPPING ME OFF": NVIDIA never guaranteed you'd get over 2GHz boost, NVIDIA in fact never even guaranteed you'd get anything more than the rated base or boost clocks. Nobody does.
Posted on Reply
#267
nguyen
BoboOOZIt's a poor node no matter how you look at it, it runs cool just because the coolers are very high quality and huge. As the video points out, it was a poor choice for Nvidia, I wonder what the yields are on it.

It will also make a horrible node for any mobile GPU, I'm curious to what will Nvidia do to come with reasonable SKU for laptops, because these gobble way too much power as they are.
AdoredTV video did not account for the fact that Nvidia has the whole Samsung 8N capacity to themselves, they would be able to produce many more Ampere chips with Samsung 8N than they would with TSMC 7nm+. Nvidia was a late customer to TSMC 7nm, they wouldn't be able to secure much capacity.

On the subject of thermal and noise,
3080 TUF has better thermal and noise than 2080 Ti Strix
3080 Gaming X Trio the same, better than 2080 Ti Trio
3080 Zotac Trinity, same thing

So far all reviewed samples of 3080 show very good thermal and noise characteristic, the 3090 samples are hotter and louder but that is to be expected.
Ampere has around 20% higher perf/watt than Turing, yes it is a little on the low side but it is a compromise people have to accept to get a better perf/dollar, I expect any AMD GPU owner would understand this :D
Posted on Reply
#268
steen
BoboOOZIt's a poor node no matter how you look at it, it runs cool just because the coolers are very high quality and huge. As the video points out, it was a poor choice for Nvidia, I wonder what the yields are on it.

It will also make a horrible node for any mobile GPU, I'm curious to what will Nvidia do to come with reasonable SKU for laptops, because these gobble way too much power as they are.
Samsung 8LPU will certainly be interesting to see how it matures for Nv. GA102 is still drawing >500W peaks @ ~20ms, so it's likely a number of factors including PSU (esp split rails). The transients of a 23b xtor die, esp lower bin tiers, are likely causing conniptions at board level/mb/psu. The stock boost algo will likely need to be less aggressive & max P state lowered/locked. The above linked review focuses on temps as an arbiter of stability, for some reason, not power. Perhaps if it wasn't an open air testbed...
nguyenAmpere has around 20% higher perf/watt than Turing, yes it is a little on the low side but it is a compromise people have to accept to get a better perf/dollar
Only if you drink the Tu pricing koolaid. As for Samsung 8N all to themselves - if you define it that way, I guess...
Posted on Reply
#269
kiriakost
trsskater63I'm pretty be sure the fix is going to be an under-clock since they still perform above the promise on the box when underclocked. I don't think they would go back and retool their design unless they are going to make a new model that will cost more.
Personally I think that this occasion will be a good test of how its one brand will respond to their customers.
Real and responsible brands will offer specific pack of solutions or choices to their customers.

The low-end they might simply hide their head under the sand, under-clock will be their only offering or a refund if you are a lucky one.
Posted on Reply
#270
TheoneandonlyMrK
kiriakostThis text does not make any sense, and from now and on all of you, please use AIBS or AIB acronym at it full form so confusion to be avoided.
a) AIB to refer to 'non reference' graphics card designs.
b) An AIB supplier or an AIB partner is a company that buys the AMD (or Nvidia) Graphics Processor Unit to put on a board and then bring a complete and usable Graphics Card or AIB to market.
You must be unaware of any sort of irony
AsRockBut if the AIB's actually tested them fully they would of hit the issue, maybe they knew about it and thought fck it.
The IFS are massive , but you're idea of creating blame based on maybes doesn't sit right with me.

Nvidia rushed their own Fe development, yet gave time to ,AIB.

Yeah right, it's a rushed launch, I'm sure blame will be thrown about but I'm not buying, so my concern and care levels are minimum , I have an opinion yes but I have said it, leave me out of the debate until you have something other than your opinion to discuss, because I don't give a shit what you Think,. I stated Facts.
Posted on Reply
#271
kiriakost
mtcn77Thanks for making it easier for me to give an example.
Since this is about power delivery, it has to "match" the power requirement of the normal operating bevaviour.
Since the testing utility is good, but doesn't test at the same temperature ranges an overclocked case can rise up to, we'll have to reserve ourselves to more moderate speeds than what the utility can have us believe......................... From there, I would play either with the fan curve, or voltage, or if on the cpu with LLC(you couldn't pick its temperature gradient if you didn't log everything up until here), but basically I find it more exciting to bust cards using this method than to use them daily, lol.
You are welcome, this is the old pack of OC - hacking a VGA how-to.
Lets return to today and latest edge of GPU architecture.
RTX 3080 due it high pricing this is now considered as investment.
NVIDIA did use additional tricks to protect it work (product) so to minimize the fail rate, it is extremely costly to handle an 1000 Euro worth of VGA card about return to base for an exchange.
I would not be impressed if the people later on will discover that even BIOS_Flash at those cards this is locked by password.

I wrote too much in this topic, now I will simply take a seat at the back of the buss and I will wait so to inspect the quality degree of product support, that all major brands will deliver to their customers.
theoneandonlymrkYou must be unaware of any sort of irony
There is no good enough schools to teach us foreigners at the detection of sentiments due written text.
My advice to Americans, use neutral clear text as description of your true point which you are up to make.
TPU this is read internationally, this is not a neighborhood of Dallas - Texas
Posted on Reply
#272
TheoneandonlyMrK
kiriakostYou are welcome, this is the old pack of OC - hacking a VGA how-to.
Lets return to today and latest edge of GPU architecture.
RTX 3080 due it high pricing this is now considered as investment.
NVIDIA did use additional tricks to protect it work (product) so to minimize the fail rate, it is extremely costly to handle an 1000 Euro worth of VGA card about return to base for an exchange.
I would not be impressed if the people later on will discover that even BIOS_Flash at those cards this is locked by password.

I wrote too much in this topic, now I will simply take a seat at the back of the buss and I will wait so to inspect the quality degree of product support, that all major brands will deliver to their customers.


There is no good enough schools to teach us foreigners at the detection of sentiments due written text.
My advice to Americans, use neutral clear text as description of your true point which you are up to make.
TPU this is read internationally, this is not a neighborhood of Dallas - Texas
My advice don't pull someone up for using AIBS and then rant to us telling us we have to use the same abbreviation you just pulled someone up for.
You are not the English language police, you can tell me how to do nothing, sir. ... .
And I'm English not American.
Posted on Reply
#273
nguyen
steenOnly if you drink the Tu pricing koolaid. As for Samsung 8N all to themselves - if you define it that way, I guess...
3080 is like 90% faster than 1080 Ti, selling at the same price, this is a very sizeable performance gain for just 2 generations apart. If you skipped on Turing then Ampere is the logical upgrade from Pascal, which Jensen Huang did specifically pointed out during his presentation :D.

Yeah Turing was known for its terrible perf/dollar, lucky for Nvidia that Navi was not that much better anyways...
Posted on Reply
#274
mtcn77
I just wonder how hard would it be to write a green team version of ATi Tray Tools with its built in overclock error monitoring tool? Nvidia could even purchase the software wholesale from Mr. Ray Adams. Not that big of a deal. There are people who would enjoy breaking the cards for them, a point of reference, try running 'vsync on' unless you want to break the solder joints too soon.
Posted on Reply
#275
Vayra86
theoneandonlymrkMy advice don't pull someone up for using AIBS and then rant to us telling us we have to use the same abbreviation you just pulled someone up for.
You are not the English language police, you can tell me how to do nothing, sir. ... .
And I'm English not American.
You're responding to someone who hasn't managed a single correct English sentence to save his life... I nearly fell off my chair :roll::roll::roll: Dafuq is happening to the world?
AssimilatorIt's almost like computer silicon gets unstable when you clock it past its limits.
Almost like this has been true since silicon has been used in computers.
Almost like overclock instability related to silicon limits has nothing to do with capacitor choice.
Almost like this is a non-issue that has been blown way out of proportion.

As for those people who will say "but some people get over 2GHz": silicon lottery.
As for those people who will say "but MUH CLOCKS NVIDIA IS RIPPING ME OFF": NVIDIA never guaranteed you'd get over 2GHz boost, NVIDIA in fact never even guaranteed you'd get anything more than the rated base or boost clocks. Nobody does.
Small caveat, these cards boost beyond 2 Ghz without touching the dials. So out of the box, they can simply boost to oblivion. This is not right, and the end result is you're going to find a performance limitation to avoid that. GPU Boost should be able to account for differences in silicon lottery, or it should be tweaked. Either way, its a handicap (and whatever is rated on the box is irrelevant in that sense, right? We know better by now and cards aren't reviewed on base clocks either)

Its not a non issue at all. Previous generations worked a lot more smoothly with GPU Boost peaking up high at the beginning of a load, and sustained too. The ripoff part...myeah... its not substantial in any way. But it does tell us a big deal about the quality of this generation and the design choices they've been making for it.

The whole rock solid GPU Boost perception we used to have... has been smashed to pieces with this. For me at least. Its a big stain on Nvidia's rep, if you ask me.
Posted on Reply
Add your own comment
Nov 18th, 2024 21:33 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts