Wednesday, October 9th 2024

NVIDIA "Blackwell" GB200 Server Dedicates Two-Thirds of Space to Cooling at Microsoft Azure

Late Tuesday, Microsoft Azure shared an interesting picture on its social media platform X, showcasing the pinnacle of GPU-accelerated servers—NVIDIA "Blackwell" GB200-powered AI systems. Microsoft is one of NVIDIA's largest customers, and the company often receives products first to integrate into its cloud and company infrastructure. Even NVIDIA listens to feedback from companies like Microsoft about designing future products, especially those like the now-canceled NVL36x2 system. The picture below shows a massive cluster that roughly divides the compute area into a single-third of the entire system, with a gigantic two-thirds of the system dedicated to closed-loop liquid cooling.

The entire system is connected using Infiniband networking, a standard for GPU-accelerated systems due to its lower latency in packet transfer. While the details of the system are scarce, we can see that the integrated closed-loop liquid cooling allows the GPU racks to be in a 1U form for increased density. Given that these systems will go into the wider Microsoft Azure data centers, a system needs to be easily maintained and cooled. There are indeed limits in power and heat output that Microsoft's data centers can handle, so these types of systems often fit inside internal specifications that Microsoft designs. There are more compute-dense systems, of course, like NVIDIA's NVL72, but hyperscalers should usually opt for other custom solutions that fit into their data center specifications. Finally, Microsoft noted that we can expect to see more details at the upcoming Microsoft Ignite conference in November and learn more about its GB200-powered AI systems.
Source: Microsoft on X
Add your own comment

11 Comments on NVIDIA "Blackwell" GB200 Server Dedicates Two-Thirds of Space to Cooling at Microsoft Azure

#1
Carillon
Water costs less than wells
Posted on Reply
#2
StimpsonJCat
This is what happens when you take the easy option and do not make architectural changes and smarter designs, and just overclock and over volt for a "free upgrade". NV haven't made any major architectural updates to their GPU for many years now - they just bolt on more of the same, max it up to the reticle limit, then OC it to meet the performance goal. Very cheap and fast to do, but we end up with this monstrosity.

NV will need to actually come up with a new architecture to move the needle on the next chip, as TSMC is at their limits now, and nothing new that can manufacture a GPU at this size for NV is close for at least another 2 years.

NV really need to separate their AI and GPU business and make optimized versions of each.
Posted on Reply
#3
JIWIL
So how long before the cooling needs of our AI datacenters can provide steam turbine power for our industry needs to provide more AI power to power our AI overlords?
Posted on Reply
#4
LittleBro
Excuse me, what other purpose serve these chips except for generating heat? Well, if they power up Microsofts co-pilot-like stuff, LLM and generative AI, that the heat is better purpose. As they say in GoT: "Winter is coming".
Posted on Reply
#6
Endymio
StimpsonJCatThis is what happens when you take the easy option and do not make architectural changes and smarter designs, and just overclock and over volt for a "free upgrade".
What smart "architectural changes" would you make? Be specific, with calculated details on their effects on manufacturing costs, yield rates, and power:performance ratios.
Posted on Reply
#7
TheinsanegamerN
StimpsonJCatThis is what happens when you take the easy option and do not make architectural changes and smarter designs, and just overclock and over volt for a "free upgrade". NV haven't made any major architectural updates to their GPU for many years now - they just bolt on more of the same, max it up to the reticle limit, then OC it to meet the performance goal. Very cheap and fast to do, but we end up with this monstrosity.

NV will need to actually come up with a new architecture to move the needle on the next chip, as TSMC is at their limits now, and nothing new that can manufacture a GPU at this size for NV is close for at least another 2 years.

NV really need to separate their AI and GPU business and make optimized versions of each.
No arch changes? Really? You saying that ampere, ada, and pascal are the same now?

:laugh::roll::laugh::banghead::laugh::roll::laugh:
JIWILSo how long before the cooling needs of our AI datacenters can provide steam turbine power for our industry needs to provide more AI power to power our AI overlords?
Sadly never, because these chips dont have anywhere near the thermal output or max temperature needed to make high pressure steam.
Posted on Reply
#8
Wirko
BeertintedgogglesAnyone else notice the towel at the bottom of the radiator?
I'm afraid this is not even a radiator, just a water-water heat exchanger. The thick pipes at the top connect to the really big radiator outside the building.
Posted on Reply
#9
Daven
BeertintedgogglesAnyone else notice the towel at the bottom of the radiator?
Some engineer must be a fan of Hitchhiker’s Guide to the Galaxy or there is a leak.
Posted on Reply
#10
phanbuey
WirkoI'm afraid this is not even a radiator, just a water-water heat exchanger. The thick pipes at the top connect to the really big radiator outside the building.
AFAIK the Cornell datacenter uses one of the finger lakes as a resevoir for the second part of that -- I'm sure there are others that do this.
Posted on Reply
#11
SoppingClam
Get a diploma in AI refrigeration mechanical engineering maintenance for new multi-point failure water cooling server farms. AI . It's hip!
Posted on Reply
Add your own comment
Dec 11th, 2024 20:31 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts