Monday, November 18th 2024

NVIDIA "Blackwell" NVL72 Servers Reportedly Require Redesign Amid Overheating Problems

According to The Information, NVIDIA's latest "Blackwell" processors are reportedly encountering significant thermal management issues in high-density server configurations, potentially affecting deployment timelines for major tech companies. The challenges emerge specifically in NVL72 GB200 racks housing 72 GB200 processors, which can consume up to 120 kilowatts of power per rack, weighting a "mere" 3,000 pounds (or about 1.5 tons). These thermal concerns have prompted NVIDIA to revisit and modify its server rack designs multiple times to prevent performance degradation and potential hardware damage. Hyperscalers like Google, Meta, and Microsoft, who rely heavily on NVIDIA GPUs for training their advanced language models, have allegedly expressed concerns about possible delays in their data center deployment schedules.

The thermal management issues follow earlier setbacks related to a design flaw in the Blackwell production process. The problem stemmed from the complex CoWoS-L packaging technology, which connects dual chiplets using RDL interposer and LSI bridges. Thermal expansion mismatches between various components led to warping issues, requiring modifications to the GPU's metal layers and bump structures. A company spokesperson characterized these modifications as part of the standard development process, noting that a new photomask resolved this issue. The Information states that mass production of the revised Blackwell GPUs began in late October, with shipments expected to commence in late January. However, these timelines are unconfirmed by NVIDIA, and some server makers like Dell confirmed that these GB200 NVL72 liquid-cooled systems are shipping now, not in January, with CoreWave GPU cloud provider as a customer. The original report could be using older information, as Dell is one of NVIDIA's most significant partners and among the first in the supply chain to gain access to new GPU batches.
Sources: The Information, Michael Dell
Add your own comment

15 Comments on NVIDIA "Blackwell" NVL72 Servers Reportedly Require Redesign Amid Overheating Problems

#1
Gucky
120000W are not easy to cool, when some people have problems cooling a 250W CPU...
Posted on Reply
#2
Quicks
Burn baby burn.

Trying to beat Intel at who has the highest wattage and highest heat.
Posted on Reply
#3
Space Lynx
Astronaut
Greed blinds all men it turns out, even the smart ones. I guess the ancient philosophers were right all along.
Posted on Reply
#4
Daven
Wasn't this so easy to see coming?!

And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
Posted on Reply
#5
VulkanBros
A new GTX 480 it seems ..... I smell fried egg's :laugh:
Posted on Reply
#6
_roman_
Gucky120000W are not easy to cool, when some people have problems cooling a 250W CPU...
Please stick to the metric prefixes -> 10^3 = 1000 = kilo = k

up to 120 kW vs up to 0.35KW Intel CPU (according to some tech news, i think i already read 447Watt also = 0.447kW)

en.wikipedia.org/wiki/Metric_prefix

It all depends on the transistor count. I do not know that well that nivdia processor in question.
Posted on Reply
#7
TheinsanegamerN
DavenWasn't this so easy to see coming?!

And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
Our species is thinking "hmm, with double the power we can do triple the work. FANTASTIC"

Dont worry, let the engineers handle it.
Posted on Reply
#8
Wirko
Gucky120000W are not easy to cool, when some people have problems cooling a 250W CPU...
Just imagine you had an empty 42U server rack and a pile of 300 unboxed 4090 GPUs next to it, with an assignment to stack them all inside the rack. It would be possible but little empty space would remain.
DavenWasn't this so easy to see coming?!

And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
Immersion cooling in molten bitumen?
Posted on Reply
#9
Punkenjoy
Gucky120000W are not easy to cool, when some people have problems cooling a 250W CPU...
We can cool car that dissipate more heat than that. They indeed run at higher temperatures but still. (Look at trucks towing test where they drive maximum load up hill.) some could say we cool megawatts of thermal energy in power plants cooling tower.

So its not really the amount of heat the problem. The density doesn't help but engine are smaller than a rack. It's probably more trying to cool this with a very small delta from ambient that is the main issue. You would need a lot of flow in your water loop
Posted on Reply
#10
Dr. Dro
TheinsanegamerNOur species is thinking "hmm, with double the power we can do triple the work. FANTASTIC"

Dont worry, let the engineers handle it.
Double the power and triple the work is actually a very good deal, efficiency-wise. All recent GPU designs irrespective of vendor raised their nominal power to maximize their generational improvement, it's a sign that Moore's law is slowing down.
Space LynxGreed blinds all men it turns out, even the smart ones. I guess the ancient philosophers were right all along.
The machines are obviously built to the demand and specification of the clients. There's no "greed" here beyond the usual corporate behavior, and we're enabling them by using the services they provide.
Posted on Reply
#11
Space Lynx
Astronaut
Dr. DroDouble the power and triple the work is actually a very good deal, efficiency-wise. All recent GPU designs irrespective of vendor raised their nominal power to maximize their generational improvement, it's a sign that Moore's law is slowing down.



The machines are obviously built to the demand and specification of the clients. There's no "greed" here beyond the usual corporate behavior, and we're enabling them by using the services they provide.
the strange thing is, they aren't really making money off the Common Man, chatpgt does not show me any ads, does not cost me anything, and its the only AI that the Common Man uses
Posted on Reply
#12
Dr. Dro
Space Lynxthe strange thing is, they aren't really making money off the Common Man, chatpgt does not show me any ads, does not cost me anything, and its the only AI that the Common Man uses
Well, the companies that order systems like these are either in big tech or specialized AI vendors that power the AI engines that go in just about every platform nowadays. Ranging from Google, Apple, Microsoft, OpenAI, Meta, X, etc. to even relatively small, local businesses that now use LLMs in their portfolio. For example, I subscribe to an educational resource to help me with law school and they recently added an AI prompt for it to evaluate and solve complex questions, and I gotta say, it's actually remarkably good at it. The servers that process things like these, are all provided by one of those big companies.

AI is... everywhere nowadays, it would seem. It's every bit as big an industry as it's claimed to be, and I'd wager it went even farther than the cryptocurrency thing ever hoped to.
Posted on Reply
#13
R-T-B
VulkanBrosA new GTX 480 it seems ..... I smell fried egg's :laugh:
Please. I LIKED Fermi, but these new wattages cannot even be compared. They are madness.
Dr. DroThe machines are obviously built to the demand and specification of the clients. There's no "greed" here beyond the usual corporate behavior
So greed, got it.
Posted on Reply
#14
Dr. Dro
R-T-BSo greed, got it.
As long as we are capitalists, greed is the driving force behind innovation :)
Posted on Reply
#15
R-T-B
Dr. DroAs long as we are capitalists, greed is the driving force behind innovation :)
True, but just pointing out the obvious I guess, heh.

Also, greed driven innovation sometimes needs some tempering, or our planet will end up with both a lot of issues and an energy crisis. Lookin' at you, AI chatbot.
Posted on Reply
Add your own comment
Jan 12th, 2025 04:30 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts