Friday, October 18th 2024

Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

Oct 18th, 2024 10:52 Discuss (11 Comments)

During the Open Compute Project (OCP) Summit 2024, Meta, one of the prime members of the OCP project, showed its NVIDIA "Blackwell" GB200 systems for its massive data centers. We previously covered Microsoft's Azure server rack with GB200 GPUs featuring one-third of the rack space for computing and two-thirds for cooling. A few days later, Google showed off its smaller GB200 system, and today, Meta is showing off its GB200 system—the smallest of the bunch. To train a dense transformer large language model with 405B parameters and a context window of up to 128k tokens, like the Llama 3.1 405B, Meta must redesign its data center infrastructure to run a distributed training job on two 24,000 GPU clusters. That is 48,000 GPUs used for training a single AI model.

Called "Catalina," it is built on the NVIDIA Blackwell platform, emphasizing modularity and adaptability while incorporating the latest NVIDIA GB200 Grace Blackwell Superchip. To address the escalating power requirements of GPUs, Catalina introduces the Orv3, a high-power rack capable of delivering up to 140kW. The comprehensive liquid-cooled setup encompasses a power shelf supporting various components, including a compute tray, switch tray, the Orv3 HPR, Wedge 400 fabric switch with 12.8 Tbps switching capacity, management switch, battery backup, and a rack management controller. Interestingly, Meta also upgraded its "Grand Teton" system for internal usage, such as deep learning recommendation models (DLRMs) and content understanding with AMD Instinct MI300X. Those are used to inference internal models, and MI300X appears to provide the best performance per Dollar for inference. According to Meta, the computational demand stemming from AI will continue to increase exponentially, so more NVIDIA and AMD GPUs is needed, and we can't wait to see what the company builds.

Source: Meta

Add your own comment

11 Comments on Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

MacZ

we can't wait to see what the company builds.

We have yet to see what productivity improvement all these investments lead to.

Right now, this seems to be an exercise in buying as much hardware as possible.

And as for AI for the consumers :
- It didn't do much for the iPhones of Apple it seems.
- Nobody forced electricity on people for it to succeed.
- Nobody forced the car on people for it to succeed.

Just saying.

Aquinus

Resident Wat-man

MacZRight now, this seems to be an exercise in buying as much hardware as possible.

...and roasting as much power as possible on it.

AleksandarKa high-power rack capable of delivering up to 140kW.

You could run a neighborhood off of 140kW. The average US household uses 30kWh a day. 140kW sustained for 24 hours is 3,360kWh. That's 112 average households that fits within that power budget assuming they are all drawing at a constant rate over time. Either way, you get the idea.

igormp

MacZWe have yet to see what productivity improvement all these investments lead to.

Right now, this seems to be an exercise in buying as much hardware as possible.

Meta does tons of research and their team does make use of those GPUs quite a lot. All the Llama models out there, along with the tons of models they made freely available are proof of that:
huggingface.co/facebook
huggingface.co/meta-llama

MacZ

igormpMeta does tons of research and their team does make use of those GPUs quite a lot. All the Llama models out there, along with the tons of models they made freely available are proof of that:
huggingface.co/facebook
huggingface.co/meta-llama

Money spent and money returned.

Spending a lot of money is the easy part. Everybody helps you to do it.

Meta also spent a ton of money on the Metaverse. How is it going ?

igormp

MacZMoney spent and money returned.

Spending a lot of money is the easy part. Everybody helps you to do it.

Meta also spent a ton of money on the Metaverse. How is it going ?

Their financials seem to be doing great:
investor.fb.com/investor-news/press-release-details/2024/Meta-Reports-Second-Quarter-2024-Results/default.aspx

I don't think it's as easy as saying "we have X many more GPUs, our revenue increased Y% because of that".
R&D is hard to measure in such way.

MacZ

igormpTheir financials seem to be doing great:
investor.fb.com/investor-news/press-release-details/2024/Meta-Reports-Second-Quarter-2024-Results/default.aspx

I don't think it's as easy as saying "we have X many more GPUs, our revenue increased Y% because of that".
R&D is hard to measure in such way.

That's why I asked what productivity improvement was made.

And your lack of answer to this simple question means it is, for now, none.

What we have is a lot of FOMO investing throwing money around, hoping something sticks and that their money sticks to the next Amazon and not the next pets.com.

ML is very useful in a lot of domains (the IRS used it to detect fraud to great effect recently). But LLMs are very resource intensive without any clear path to profit.

The AI gnomes :
1. LLM
2. ???
3. Profit

For some reason, when people/websites talk about AI, they only talk about LLMs.

igormp

MacZThat's why I asked what productivity improvement was made.

And your lack of answer to this simple question means it is, for now, none.

My lack of a proper answer is simply because I do not work at Meta, I can't say what difference it makes internally.

MacZML is very useful in a lot of domains (the IRS used it to detect fraud to great effect recently). But LLMs are very resource intensive without any clear path to profit.

Meta works in many models other than LLMs, just see their hugging face page. I've personally used many of their vision models.
In their own post, they do mention how this platform will be used for their recommendation models, which I'm pretty sure is what drives most of their earnings in one way or another. But, again, I can't put those into numbers since I don't work there.

MacZ

igormpMy lack of a proper answer is simply because I do not work at Meta, I can't say what difference it makes internally.

If any LLMs made any real difference, it would be public knowledge by now, because there are a lot of models freely or cheaply available.

igormpMeta works in many models other than LLMs, just see their hugging face page. I've personally used many of their vision models.
In their own post, they do mention how this platform will be used for their recommendation models, which I'm pretty sure is what drives most of their earnings in one way or another. But, again, I can't put those into numbers since I don't work there.

Yet I doubt a recommandation model requires 50,000 GPUs.

Zareek

The new way to waste massive amounts of energy!

#10

Solaris17

Super Dainty Moderator

MacZIf any LLMs made any real difference, it would be public knowledge by now, because there are a lot of models freely or cheaply available.

Yet I doubt a recommandation model requires 50,000 GPUs.

It doesnt matter. Your basis is fundamentally flawed. You compared Meta AI to the likes of:

MacZAnd as for AI for the consumers :
- It didn't do much for the iPhones of Apple it seems.
- Nobody forced electricity on people for it to succeed.
- Nobody forced the car on people for it to succeed.

But Apple told you Siri was useful, Google told you AI was useful, Microsoft told you Cortana was useful.

Meta is making AI for themselves and there eco system.

They dont owe you anything. They want it to be more, but they arent "shoving it in your face." I dont have Llama on my phone, I dont have Meta AI on my PC, It isnt trying to control my thermostat.

Besides as mentioned, they also give a lot back as the article suggests. Not only do they provide there pre-trained models, but all the hardware is OCP. Your not paying Dell, Nvidia, HPE etc to model your DC.

www.opencompute.org/search?site_search%5Bquery%5D=meta

github.com/facebook

I mean I dont use any of there products, but if your gonna compare your off base.

#11

MacZ

Solaris17they dont owe you anything. They want it to be more, but they arent "shoving it in your face." I dont have Llama on my phone, I dont have Meta AI on my PC, It isnt trying to control my thermostat.

Besides as mentioned, they also give a lot back as the article suggests. Not only do they provide there pre-trained models, but all the hardware is OCP. Your not paying Dell, Nvidia, HPE etc to model your DC.

Yeah, no.

What the AI gnomes equation is :

Phase 1 : LLM
Phase 2a : Detect what speech is unwanted and has to be censored (by who ?)
Phase 2b : Inject new speech that serves the purpose (of who ?)
Phase 3 : profit by not being fined/regulated into oblivion.

And that aligns perfectly with MS trying its most to force Recall on its users, making damn sure that it will be next to impossible to remove.

No for profit company will "give back" anything for free, especially if it could be used by competitors.

Enjoy the shreds of democracy that are left.

Add your own comment

Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

11 Comments on Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts

Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

Related News

11 Comments on Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center

Latest GPU Drivers

New Forum Posts

Popular Reviews

Controversial News Posts