Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

AleksandarK · Oct 18, 2024

During the Open Compute Project (OCP) Summit 2024, Meta, one of the prime members of the OCP project, showed its NVIDIA "Blackwell" GB200 systems for its massive data centers. We previously covered Microsoft's Azure server rack with GB200 GPUs featuring one-third of the rack space for computing and two-thirds for cooling. A few days later, Google showed off its smaller GB200 system, and today, Meta is showing off its GB200 system—the smallest of the bunch. To train a dense transformer large language model with 405B parameters and a context window of up to 128k tokens, like the Llama 3.1 405B, Meta must redesign its data center infrastructure to run a distributed training job on two 24,000 GPU clusters. That is 48,000 GPUs used for training a single AI model.

Called "Catalina," it is built on the NVIDIA Blackwell platform, emphasizing modularity and adaptability while incorporating the latest NVIDIA GB200 Grace Blackwell Superchip. To address the escalating power requirements of GPUs, Catalina introduces the Orv3, a high-power rack capable of delivering up to 140kW. The comprehensive liquid-cooled setup encompasses a power shelf supporting various components, including a compute tray, switch tray, the Orv3 HPR, Wedge 400 fabric switch with 12.8 Tbps switching capacity, management switch, battery backup, and a rack management controller. Interestingly, Meta also upgraded its "Grand Teton" system for internal usage, such as deep learning recommendation models (DLRMs) and content understanding with AMD Instinct MI300X. Those are used to inference internal models, and MI300X appears to provide the best performance per Dollar for inference. According to Meta, the computational demand stemming from AI will continue to increase exponentially, so more NVIDIA and AMD GPUs is needed, and we can't wait to see what the company builds.

View at TechPowerUp Main Site | Source

MacZ · Oct 18, 2024

we can't wait to see what the company builds.

We have yet to see what productivity improvement all these investments lead to.

Right now, this seems to be an exercise in buying as much hardware as possible.

And as for AI for the consumers :
- It didn't do much for the iPhones of Apple it seems.
- Nobody forced electricity on people for it to succeed.
- Nobody forced the car on people for it to succeed.

Just saying.

Aquinus · Oct 18, 2024

MacZ said:
Right now, this seems to be an exercise in buying as much hardware as possible.

...and roasting as much power as possible on it.

AleksandarK said:
a high-power rack capable of delivering up to 140kW.

You could run a neighborhood off of 140kW. The average US household uses 30kWh a day. 140kW sustained for 24 hours is 3,360kWh. That's 112 average households that fits within that power budget assuming they are all drawing at a constant rate over time. Either way, you get the idea.

igormp · Oct 18, 2024

MacZ said:
We have yet to see what productivity improvement all these investments lead to.

Right now, this seems to be an exercise in buying as much hardware as possible.

Meta does tons of research and their team does make use of those GPUs quite a lot. All the Llama models out there, along with the tons of models they made freely available are proof of that:

https://huggingface.co/facebook

meta-llama (Meta Llama)

Org profile for Meta Llama on Hugging Face, the AI community building the future.

huggingface.co

MacZ · Oct 18, 2024

igormp said:
Meta does tons of research and their team does make use of those GPUs quite a lot. All the Llama models out there, along with the tons of models they made freely available are proof of that:

https://huggingface.co/facebook

meta-llama (Meta Llama)

Org profile for Meta Llama on Hugging Face, the AI community building the future.

huggingface.co

Money spent and money returned.

Spending a lot of money is the easy part. Everybody helps you to do it.

Meta also spent a ton of money on the Metaverse. How is it going ?

igormp · Oct 18, 2024

MacZ said:
Money spent and money returned.

Spending a lot of money is the easy part. Everybody helps you to do it.

Meta also spent a ton of money on the Metaverse. How is it going ?

Their financials seem to be doing great:

Meta Reports Second Quarter 2024 Results

Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter ended June 30, 2024. "We had a strong quarter, and Meta AI is on track to be the most used AI assistant in the world by the end of the year," said Mark Zuckerberg, Meta founder and CEO. "We've released the first...

investor.fb.com

I don't think it's as easy as saying "we have X many more GPUs, our revenue increased Y% because of that".
R&D is hard to measure in such way.

MacZ · Oct 18, 2024

igormp said:
Their financials seem to be doing great:

Meta Reports Second Quarter 2024 Results

Meta Platforms, Inc. (Nasdaq: META) today reported financial results for the quarter ended June 30, 2024. "We had a strong quarter, and Meta AI is on track to be the most used AI assistant in the world by the end of the year," said Mark Zuckerberg, Meta founder and CEO. "We've released the first...

investor.fb.com

I don't think it's as easy as saying "we have X many more GPUs, our revenue increased Y% because of that".
R&D is hard to measure in such way.

That's why I asked what productivity improvement was made.

And your lack of answer to this simple question means it is, for now, none.

What we have is a lot of FOMO investing throwing money around, hoping something sticks and that their money sticks to the next Amazon and not the next pets.com.

ML is very useful in a lot of domains (the IRS used it to detect fraud to great effect recently). But LLMs are very resource intensive without any clear path to profit.

The AI gnomes :
1. LLM
2. ???
3. Profit

For some reason, when people/websites talk about AI, they only talk about LLMs.

igormp · Oct 18, 2024

MacZ said:
That's why I asked what productivity improvement was made.

And your lack of answer to this simple question means it is, for now, none.

My lack of a proper answer is simply because I do not work at Meta, I can't say what difference it makes internally.

MacZ said:
ML is very useful in a lot of domains (the IRS used it to detect fraud to great effect recently). But LLMs are very resource intensive without any clear path to profit.

Meta works in many models other than LLMs, just see their hugging face page. I've personally used many of their vision models.
In their own post, they do mention how this platform will be used for their recommendation models, which I'm pretty sure is what drives most of their earnings in one way or another. But, again, I can't put those into numbers since I don't work there.

MacZ · Oct 18, 2024

igormp said:
My lack of a proper answer is simply because I do not work at Meta, I can't say what difference it makes internally.

If any LLMs made any real difference, it would be public knowledge by now, because there are a lot of models freely or cheaply available.

igormp said:
Meta works in many models other than LLMs, just see their hugging face page. I've personally used many of their vision models.
In their own post, they do mention how this platform will be used for their recommendation models, which I'm pretty sure is what drives most of their earnings in one way or another. But, again, I can't put those into numbers since I don't work there.

Yet I doubt a recommandation model requires 50,000 GPUs.

Zareek · Oct 18, 2024

The new way to waste massive amounts of energy!

Solaris17 · Oct 19, 2024

MacZ said:
If any LLMs made any real difference, it would be public knowledge by now, because there are a lot of models freely or cheaply available.

Yet I doubt a recommandation model requires 50,000 GPUs.

It doesnt matter. Your basis is fundamentally flawed. You compared Meta AI to the likes of:

MacZ said:
And as for AI for the consumers :
- It didn't do much for the iPhones of Apple it seems.
- Nobody forced electricity on people for it to succeed.
- Nobody forced the car on people for it to succeed.

But Apple told you Siri was useful, Google told you AI was useful, Microsoft told you Cortana was useful.

Meta is making AI for themselves and there eco system.

They dont owe you anything. They want it to be more, but they arent "shoving it in your face." I dont have Llama on my phone, I dont have Meta AI on my PC, It isnt trying to control my thermostat.

Besides as mentioned, they also give a lot back as the article suggests. Not only do they provide there pre-trained models, but all the hardware is OCP. Your not paying Dell, Nvidia, HPE etc to model your DC.

Open Compute Project

www.opencompute.org

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 4TB External
Display(s)	Laptop @ 3072x1920 + 2x LG 5k Ultrafine TB3 displays
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers
Power Supply	96w Power Adapter
Mouse	Logitech MX Master 3
Keyboard	Logitech G915, GL Clicky
Software	MacOS 12.1

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

System Name	Gaming Rig
Processor	Ryzen 7 3800X
Motherboard	Gigabyte X570 Aurus Pro Wifi
Cooling	Noctua NH-D15 chromax.black
Memory	32GB(2x16GB) Patriot Viper DDR4-3200C16
Video Card(s)	EVGA RTX 3060 Ti
Storage	Samsung 970 EVO Plus 1TB (Boot/OS)\|Hynix Platinum P41 2TB (Games)
Display(s)	Gigabyte G27F
Case	Corsair Graphite 600T w/mesh side
Audio Device(s)	Logitech Z625 2.1 \| cheapo gaming headset when mic is needed
Power Supply	Corsair HX850i
Mouse	Redragon M808-KS Storm Pro (Great Value)
Keyboard	Redragon K512 Shiva replaced a Corsair K70 Lux - Blue on Black
VR HMD	Nope
Software	Windows 11 Pro x64
Benchmark Scores	Nope