Friday, October 18th 2024
Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center
During the Open Compute Project (OCP) Summit 2024, Meta, one of the prime members of the OCP project, showed its NVIDIA "Blackwell" GB200 systems for its massive data centers. We previously covered Microsoft's Azure server rack with GB200 GPUs featuring one-third of the rack space for computing and two-thirds for cooling. A few days later, Google showed off its smaller GB200 system, and today, Meta is showing off its GB200 system—the smallest of the bunch. To train a dense transformer large language model with 405B parameters and a context window of up to 128k tokens, like the Llama 3.1 405B, Meta must redesign its data center infrastructure to run a distributed training job on two 24,000 GPU clusters. That is 48,000 GPUs used for training a single AI model.
Called "Catalina," it is built on the NVIDIA Blackwell platform, emphasizing modularity and adaptability while incorporating the latest NVIDIA GB200 Grace Blackwell Superchip. To address the escalating power requirements of GPUs, Catalina introduces the Orv3, a high-power rack capable of delivering up to 140kW. The comprehensive liquid-cooled setup encompasses a power shelf supporting various components, including a compute tray, switch tray, the Orv3 HPR, Wedge 400 fabric switch with 12.8 Tbps switching capacity, management switch, battery backup, and a rack management controller. Interestingly, Meta also upgraded its "Grand Teton" system for internal usage, such as deep learning recommendation models (DLRMs) and content understanding with AMD Instinct MI300X. Those are used to inference internal models, and MI300X appears to provide the best performance per Dollar for inference. According to Meta, the computational demand stemming from AI will continue to increase exponentially, so more NVIDIA and AMD GPUs is needed, and we can't wait to see what the company builds.
Source:
Meta
Called "Catalina," it is built on the NVIDIA Blackwell platform, emphasizing modularity and adaptability while incorporating the latest NVIDIA GB200 Grace Blackwell Superchip. To address the escalating power requirements of GPUs, Catalina introduces the Orv3, a high-power rack capable of delivering up to 140kW. The comprehensive liquid-cooled setup encompasses a power shelf supporting various components, including a compute tray, switch tray, the Orv3 HPR, Wedge 400 fabric switch with 12.8 Tbps switching capacity, management switch, battery backup, and a rack management controller. Interestingly, Meta also upgraded its "Grand Teton" system for internal usage, such as deep learning recommendation models (DLRMs) and content understanding with AMD Instinct MI300X. Those are used to inference internal models, and MI300X appears to provide the best performance per Dollar for inference. According to Meta, the computational demand stemming from AI will continue to increase exponentially, so more NVIDIA and AMD GPUs is needed, and we can't wait to see what the company builds.
11 Comments on Meta Shows Open-Architecture NVIDIA "Blackwell" GB200 System for Data Center
Right now, this seems to be an exercise in buying as much hardware as possible.
And as for AI for the consumers :
- It didn't do much for the iPhones of Apple it seems.
- Nobody forced electricity on people for it to succeed.
- Nobody forced the car on people for it to succeed.
Just saying.
huggingface.co/facebook
huggingface.co/meta-llama
Spending a lot of money is the easy part. Everybody helps you to do it.
Meta also spent a ton of money on the Metaverse. How is it going ?
investor.fb.com/investor-news/press-release-details/2024/Meta-Reports-Second-Quarter-2024-Results/default.aspx
I don't think it's as easy as saying "we have X many more GPUs, our revenue increased Y% because of that".
R&D is hard to measure in such way.
And your lack of answer to this simple question means it is, for now, none.
What we have is a lot of FOMO investing throwing money around, hoping something sticks and that their money sticks to the next Amazon and not the next pets.com.
ML is very useful in a lot of domains (the IRS used it to detect fraud to great effect recently). But LLMs are very resource intensive without any clear path to profit.
The AI gnomes :
1. LLM
2. ???
3. Profit
For some reason, when people/websites talk about AI, they only talk about LLMs.
In their own post, they do mention how this platform will be used for their recommendation models, which I'm pretty sure is what drives most of their earnings in one way or another. But, again, I can't put those into numbers since I don't work there.
Meta is making AI for themselves and there eco system.
They dont owe you anything. They want it to be more, but they arent "shoving it in your face." I dont have Llama on my phone, I dont have Meta AI on my PC, It isnt trying to control my thermostat.
Besides as mentioned, they also give a lot back as the article suggests. Not only do they provide there pre-trained models, but all the hardware is OCP. Your not paying Dell, Nvidia, HPE etc to model your DC.
www.opencompute.org/search?site_search%5Bquery%5D=meta
github.com/facebook
I mean I dont use any of there products, but if your gonna compare your off base.
What the AI gnomes equation is :
Phase 1 : LLM
Phase 2a : Detect what speech is unwanted and has to be censored (by who ?)
Phase 2b : Inject new speech that serves the purpose (of who ?)
Phase 3 : profit by not being fined/regulated into oblivion.
And that aligns perfectly with MS trying its most to force Recall on its users, making damn sure that it will be next to impossible to remove.
No for profit company will "give back" anything for free, especially if it could be used by competitors.
Enjoy the shreds of democracy that are left.