Monday, November 18th 2024
NVIDIA "Blackwell" NVL72 Servers Reportedly Require Redesign Amid Overheating Problems
According to The Information, NVIDIA's latest "Blackwell" processors are reportedly encountering significant thermal management issues in high-density server configurations, potentially affecting deployment timelines for major tech companies. The challenges emerge specifically in NVL72 GB200 racks housing 72 GB200 processors, which can consume up to 120 kilowatts of power per rack, weighting a "mere" 3,000 pounds (or about 1.5 tons). These thermal concerns have prompted NVIDIA to revisit and modify its server rack designs multiple times to prevent performance degradation and potential hardware damage. Hyperscalers like Google, Meta, and Microsoft, who rely heavily on NVIDIA GPUs for training their advanced language models, have allegedly expressed concerns about possible delays in their data center deployment schedules.
The thermal management issues follow earlier setbacks related to a design flaw in the Blackwell production process. The problem stemmed from the complex CoWoS-L packaging technology, which connects dual chiplets using RDL interposer and LSI bridges. Thermal expansion mismatches between various components led to warping issues, requiring modifications to the GPU's metal layers and bump structures. A company spokesperson characterized these modifications as part of the standard development process, noting that a new photomask resolved this issue. The Information states that mass production of the revised Blackwell GPUs began in late October, with shipments expected to commence in late January. However, these timelines are unconfirmed by NVIDIA, and some server makers like Dell confirmed that these GB200 NVL72 liquid-cooled systems are shipping now, not in January, with CoreWave GPU cloud provider as a customer. The original report could be using older information, as Dell is one of NVIDIA's most significant partners and among the first in the supply chain to gain access to new GPU batches.
Sources:
The Information, Michael Dell
The thermal management issues follow earlier setbacks related to a design flaw in the Blackwell production process. The problem stemmed from the complex CoWoS-L packaging technology, which connects dual chiplets using RDL interposer and LSI bridges. Thermal expansion mismatches between various components led to warping issues, requiring modifications to the GPU's metal layers and bump structures. A company spokesperson characterized these modifications as part of the standard development process, noting that a new photomask resolved this issue. The Information states that mass production of the revised Blackwell GPUs began in late October, with shipments expected to commence in late January. However, these timelines are unconfirmed by NVIDIA, and some server makers like Dell confirmed that these GB200 NVL72 liquid-cooled systems are shipping now, not in January, with CoreWave GPU cloud provider as a customer. The original report could be using older information, as Dell is one of NVIDIA's most significant partners and among the first in the supply chain to gain access to new GPU batches.
15 Comments on NVIDIA "Blackwell" NVL72 Servers Reportedly Require Redesign Amid Overheating Problems
Trying to beat Intel at who has the highest wattage and highest heat.
And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
up to 120 kW vs up to 0.35KW Intel CPU (according to some tech news, i think i already read 447Watt also = 0.447kW)
en.wikipedia.org/wiki/Metric_prefix
It all depends on the transistor count. I do not know that well that nivdia processor in question.
Dont worry, let the engineers handle it.
So its not really the amount of heat the problem. The density doesn't help but engine are smaller than a rack. It's probably more trying to cool this with a very small delta from ambient that is the main issue. You would need a lot of flow in your water loop
AI is... everywhere nowadays, it would seem. It's every bit as big an industry as it's claimed to be, and I'd wager it went even farther than the cryptocurrency thing ever hoped to.
Also, greed driven innovation sometimes needs some tempering, or our planet will end up with both a lot of issues and an energy crisis. Lookin' at you, AI chatbot.