• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA "Blackwell" NVL72 Servers Reportedly Require Redesign Amid Overheating Problems

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,626 (0.98/day)
According to The Information, NVIDIA's latest "Blackwell" processors are reportedly encountering significant thermal management issues in high-density server configurations, potentially affecting deployment timelines for major tech companies. The challenges emerge specifically in NVL72 GB200 racks housing 72 GB200 processors, which can consume up to 120 kilowatts of power per rack, weighting a "mere" 3,000 pounds (or about 1.5 tons). These thermal concerns have prompted NVIDIA to revisit and modify its server rack designs multiple times to prevent performance degradation and potential hardware damage. Hyperscalers like Google, Meta, and Microsoft, who rely heavily on NVIDIA GPUs for training their advanced language models, have allegedly expressed concerns about possible delays in their data center deployment schedules.

The thermal management issues follow earlier setbacks related to a design flaw in the Blackwell production process. The problem stemmed from the complex CoWoS-L packaging technology, which connects dual chiplets using RDL interposer and LSI bridges. Thermal expansion mismatches between various components led to warping issues, requiring modifications to the GPU's metal layers and bump structures. A company spokesperson characterized these modifications as part of the standard development process, noting that a new photomask resolved this issue. The Information states that mass production of the revised Blackwell GPUs began in late October, with shipments expected to commence in late January. However, these timelines are unconfirmed by NVIDIA, and some server makers like Dell confirmed that these GB200 NVL72 liquid-cooled systems are shipping now, not in January, with CoreWave GPU cloud provider as a customer. The original report could be using older information, as Dell is one of NVIDIA's most significant partners and among the first in the supply chain to gain access to new GPU batches.



View at TechPowerUp Main Site | Source
 
Joined
Mar 27, 2018
Messages
129 (0.05/day)
Processor AMD Ryzen 5 3600
Motherboard Asus ROG Strix X470-F
Cooling Reeven RC-1205
Memory G.Skill F4-3200C16D-16GTZKW TridentZ 16GB (2x8GB)
Video Card(s) Powercolor x470 red devil
Storage Mushkin MKNSSDPL500GB-D8 Pilot 500GB
Display(s) Samsung 23"
Case Phanteks PH-EC300PTG
Audio Device(s) SupremeFX S1220A
Power Supply Super Flower SF-650F14MT(BK) Leadex 650W 80 Plus Silver
Mouse Cooler master m530
Keyboard Cheapo
Burn baby burn.

Trying to beat Intel at who has the highest wattage and highest heat.
 

Space Lynx

Astronaut
Joined
Oct 17, 2014
Messages
17,392 (4.69/day)
Location
Kepler-186f
Processor 7800X3D -25 all core
Motherboard B650 Steel Legend
Cooling Frost Commander 140
Video Card(s) Merc 310 7900 XT @3100 core -.75v
Display(s) Agon 27" QD-OLED Glossy 240hz 1440p
Case NZXT H710 (Red/Black)
Audio Device(s) Asgard 2, Modi 3, HD58X
Power Supply Corsair RM850x Gold
Greed blinds all men it turns out, even the smart ones. I guess the ancient philosophers were right all along.
 
Joined
Dec 12, 2016
Messages
1,914 (0.66/day)
Wasn't this so easy to see coming?!

And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
 
Joined
Jan 31, 2005
Messages
2,095 (0.29/day)
Location
gehenna
System Name Commercial towing vehicle "Nostromo"
Processor 5800X3D
Motherboard X570 Unify
Cooling EK-AIO 360
Memory 32 GB Fury 3666 MHz
Video Card(s) 4070 Ti Eagle
Storage SN850 NVMe 1TB + Renegade NVMe 2TB + 870 EVO 4TB
Display(s) 25" Legion Y25g-30 360Hz
Case Lian Li LanCool 216 v2
Audio Device(s) Razer Blackshark v2 Hyperspeed / Bowers & Wilkins Px7 S2e
Power Supply HX1500i
Mouse Harpe Ace Aim Lab Edition
Keyboard Scope II 96 Wireless
Software Windows 11 23H2 / Fedora w. KDE
A new GTX 480 it seems ..... I smell fried egg's :laugh:
 
Joined
Jul 31, 2024
Messages
412 (3.07/day)
120000W are not easy to cool, when some people have problems cooling a 250W CPU...

Please stick to the metric prefixes -> 10^3 = 1000 = kilo = k

up to 120 kW vs up to 0.35KW Intel CPU (according to some tech news, i think i already read 447Watt also = 0.447kW)


It all depends on the transistor count. I do not know that well that nivdia processor in question.
 
Joined
Dec 28, 2012
Messages
3,909 (0.90/day)
System Name Skunkworks 3.0
Processor 5800x3d
Motherboard x570 unify
Cooling Noctua NH-U12A
Memory 32GB 3600 mhz
Video Card(s) asrock 6800xt challenger D
Storage Sabarent rocket 4.0 2TB, MX 500 2TB
Display(s) Asus 1440p144 27"
Case Old arse cooler master 932
Power Supply Corsair 1200w platinum
Mouse *squeak*
Keyboard Some old office thing
Software Manjaro
Wasn't this so easy to see coming?!

And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
Our species is thinking "hmm, with double the power we can do triple the work. FANTASTIC"

Dont worry, let the engineers handle it.
 
Joined
Jan 3, 2021
Messages
3,562 (2.48/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
120000W are not easy to cool, when some people have problems cooling a 250W CPU...
Just imagine you had an empty 42U server rack and a pile of 300 unboxed 4090 GPUs next to it, with an assignment to stack them all inside the rack. It would be possible but little empty space would remain.

Wasn't this so easy to see coming?!

And even if these problems are solved this generation, what about the next one? Rubin on 2 or 3 nm consuming 2000W per GPU? What is our species thinking?
Immersion cooling in molten bitumen?
 
Joined
Oct 12, 2005
Messages
709 (0.10/day)
120000W are not easy to cool, when some people have problems cooling a 250W CPU...
We can cool car that dissipate more heat than that. They indeed run at higher temperatures but still. (Look at trucks towing test where they drive maximum load up hill.) some could say we cool megawatts of thermal energy in power plants cooling tower.

So its not really the amount of heat the problem. The density doesn't help but engine are smaller than a rack. It's probably more trying to cool this with a very small delta from ambient that is the main issue. You would need a lot of flow in your water loop
 
Joined
Dec 25, 2020
Messages
6,934 (4.79/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG MAXIMUS Z790 APEX ENCORE
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard Generic PS/2
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
Our species is thinking "hmm, with double the power we can do triple the work. FANTASTIC"

Dont worry, let the engineers handle it.

Double the power and triple the work is actually a very good deal, efficiency-wise. All recent GPU designs irrespective of vendor raised their nominal power to maximize their generational improvement, it's a sign that Moore's law is slowing down.

Greed blinds all men it turns out, even the smart ones. I guess the ancient philosophers were right all along.

The machines are obviously built to the demand and specification of the clients. There's no "greed" here beyond the usual corporate behavior, and we're enabling them by using the services they provide.
 

Space Lynx

Astronaut
Joined
Oct 17, 2014
Messages
17,392 (4.69/day)
Location
Kepler-186f
Processor 7800X3D -25 all core
Motherboard B650 Steel Legend
Cooling Frost Commander 140
Video Card(s) Merc 310 7900 XT @3100 core -.75v
Display(s) Agon 27" QD-OLED Glossy 240hz 1440p
Case NZXT H710 (Red/Black)
Audio Device(s) Asgard 2, Modi 3, HD58X
Power Supply Corsair RM850x Gold
Double the power and triple the work is actually a very good deal, efficiency-wise. All recent GPU designs irrespective of vendor raised their nominal power to maximize their generational improvement, it's a sign that Moore's law is slowing down.



The machines are obviously built to the demand and specification of the clients. There's no "greed" here beyond the usual corporate behavior, and we're enabling them by using the services they provide.

the strange thing is, they aren't really making money off the Common Man, chatpgt does not show me any ads, does not cost me anything, and its the only AI that the Common Man uses
 
Joined
Dec 25, 2020
Messages
6,934 (4.79/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG MAXIMUS Z790 APEX ENCORE
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard Generic PS/2
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
the strange thing is, they aren't really making money off the Common Man, chatpgt does not show me any ads, does not cost me anything, and its the only AI that the Common Man uses

Well, the companies that order systems like these are either in big tech or specialized AI vendors that power the AI engines that go in just about every platform nowadays. Ranging from Google, Apple, Microsoft, OpenAI, Meta, X, etc. to even relatively small, local businesses that now use LLMs in their portfolio. For example, I subscribe to an educational resource to help me with law school and they recently added an AI prompt for it to evaluate and solve complex questions, and I gotta say, it's actually remarkably good at it. The servers that process things like these, are all provided by one of those big companies.

AI is... everywhere nowadays, it would seem. It's every bit as big an industry as it's claimed to be, and I'd wager it went even farther than the cryptocurrency thing ever hoped to.
 
Joined
Aug 20, 2007
Messages
21,503 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
A new GTX 480 it seems ..... I smell fried egg's :laugh:
Please. I LIKED Fermi, but these new wattages cannot even be compared. They are madness.

The machines are obviously built to the demand and specification of the clients. There's no "greed" here beyond the usual corporate behavior
So greed, got it.
 
Joined
Dec 25, 2020
Messages
6,934 (4.79/day)
Location
São Paulo, Brazil
System Name "Icy Resurrection"
Processor 13th Gen Intel Core i9-13900KS Special Edition
Motherboard ASUS ROG MAXIMUS Z790 APEX ENCORE
Cooling Noctua NH-D15S upgraded with 2x NF-F12 iPPC-3000 fans and Honeywell PTM7950 TIM
Memory 32 GB G.SKILL Trident Z5 RGB F5-6800J3445G16GX2-TZ5RK @ 7600 MT/s 36-44-44-52-96 1.4V
Video Card(s) ASUS ROG Strix GeForce RTX™ 4080 16GB GDDR6X White OC Edition
Storage 500 GB WD Black SN750 SE NVMe SSD + 4 TB WD Red Plus WD40EFPX HDD
Display(s) 55-inch LG G3 OLED
Case Pichau Mancer CV500 White Edition
Power Supply EVGA 1300 G2 1.3kW 80+ Gold
Mouse Microsoft Classic Intellimouse
Keyboard Generic PS/2
Software Windows 11 IoT Enterprise LTSC 24H2
Benchmark Scores I pulled a Qiqi~
Joined
Aug 20, 2007
Messages
21,503 (3.40/day)
System Name Pioneer
Processor Ryzen R9 9950X
Motherboard GIGABYTE Aorus Elite X670 AX
Cooling Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory 64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s) XFX RX 7900 XTX Speedster Merc 310
Storage Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s) 55" LG 55" B9 OLED 4K Display
Case Thermaltake Core X31
Audio Device(s) TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply FSP Hydro Ti Pro 850W
Mouse Logitech G305 Lightspeed Wireless
Keyboard WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software Gentoo Linux x64 / Windows 11 Enterprise IoT 2024
As long as we are capitalists, greed is the driving force behind innovation :)
True, but just pointing out the obvious I guess, heh.

Also, greed driven innovation sometimes needs some tempering, or our planet will end up with both a lot of issues and an energy crisis. Lookin' at you, AI chatbot.
 
Top