NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI

AleksandarK · Oct 29, 2024

NVIDIA today announced that xAI's Colossus supercomputer cluster comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X Ethernet networking platform, which is designed to deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.

Colossus, the world's largest AI supercomputer, is being used to train xAI's Grok family of large language models, with chatbots offered as a feature for X Premium subscribers. xAI is in the process of doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.

The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years. It took 19 days from the time the first rack rolled onto the floor until training began.

While training the extremely large Grok model, Colossus achieves unprecedented network performance. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control.

This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering only 60% data throughput.

"AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency," said Gilad Shainer, senior vice president of networking at NVIDIA. "The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions."

"Colossus is the most powerful training system in the world," said Elon Musk on X. "Nice work by xAI team, NVIDIA and our many partners/suppliers."

"xAI has built the world's largest, most-powerful supercomputer," said a spokesperson for xAI. "NVIDIA's Hopper GPUs and Spectrum-X allow us to push the boundaries of training AI models at a massive-scale, creating a super-accelerated and optimized AI factory based on the Ethernet standard."

At the heart of the Spectrum-X platform is the Spectrum SN5600 Ethernet switch, which supports port speeds of up to 800 Gb/s and is based on the Spectrum-4 switch ASIC. xAI chose to pair the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs for unprecedented performance.

Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation—all key requirements for multi-tenant generative AI clouds and large enterprise environments.

View at TechPowerUp Main Site

Shihab · Oct 29, 2024

AleksandarK said:
The supporting facility and state-of-the-art supercomputer was built by xAI and NVIDIA in just 122 days, instead of the typical timeframe for systems of this size that can take many months to years.

Well, yes. Leasing an already constructed facility is much faster than starting one yourself. Whowuddathunkit?
But I suppose, skipping both dealing with the locals and waiting for infrastructure in favour of getting your juice the dirty way is also one way to speed things up. And let's not forget doing work off the clock and under the table, that would shave months off those progress charts. It ain't time spent if you are not counting it, right?

AleksandarK said:
AI is becoming mission-critical and requires increased performance

A sad statement coming from someone who's job is handling stuff that are truly "mission critical." Even more so when we're talking about the most gimmicky subfield of AI.

VulkanBros · Oct 29, 2024

Denmark Launches Leading Sovereign AI Supercomputer to Solve Scientific Challenges With Social Impact

NVIDIA founder and CEO Jensen Huang joined the king of Denmark to launch the country’s largest sovereign AI supercomputer, aimed at breakthroughs in quantum computing, clean energy, biotechnology and other areas serving Danish society and the world. Denmark’s first AI supercomputer, named Gefion...

blogs.nvidia.com

ScaLibBDP · Oct 29, 2024

Shihab said:
Well, yes. Leasing an already constructed facility is much faster than starting one yourself. Whowuddathunkit?
But I suppose, skipping both dealing with the locals and waiting for infrastructure in favour of getting your juice the dirty way is also one way to speed things up. And let's not forget doing work off the clock and under the table, that would shave months off those progress charts. It ain't time spent if you are not counting it, right?

A sad statement coming from someone who's job is handling stuff that are truly "mission critical." Even more so when we're talking about the most gimmicky subfield of AI.

>>...Well, yes. Leasing an already constructed facility is much faster than starting one yourself....

I don't know another company that would build a supercomputer in 4 months...

As an example, take a look at a timeline of Department of Energy Aurora supercomputer. It took almost 8 years for Intel to build it!

System Name	192.168.1.1~192.168.1.100
Processor	AMD Ryzen5 5600G.
Motherboard	Gigabyte B550m DS3H.
Cooling	AMD Wraith Stealth.
Memory	16GB Crucial DDR4.
Video Card(s)	Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage	Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s)	ViewSonic VA2406-MH 75Hz
Case	Bitfenix Nova Midi
Audio Device(s)	On-Board.
Power Supply	SeaSonic CORE GM-650.
Mouse	Logitech G300s
Keyboard	Kingston HyperX Alloy FPS.
VR HMD	A pair of OP spectacles.
Software	Ubuntu 24.04 LTS.
Benchmark Scores	Me no know English. What bench mean? Bench like one sit on?

System Name	Commercial towing vehicle "Nostromo"
Processor	5800X3D
Motherboard	X570 Unify
Cooling	EK-AIO 360
Memory	32 GB Fury 3666 MHz
Video Card(s)	4070 Ti Eagle
Storage	SN850 NVMe 1TB + Renegade NVMe 2TB + 870 EVO 4TB
Display(s)	25" Legion Y25g-30 360Hz
Case	Lian Li LanCool 216 v2
Audio Device(s)	Razer Blackshark v2 Hyperspeed / Bowers & Wilkins Px7 S2e
Power Supply	HX1500i
Mouse	Harpe Ace Aim Lab Edition
Keyboard	Scope II 96 Wireless
Software	Windows 11 23H2 / Fedora w. KDE

NVIDIA Ethernet Networking Accelerates World's Largest AI Supercomputer, Built by xAI

AleksandarK

News Editor

Shihab

VulkanBros

Denmark Launches Leading Sovereign AI Supercomputer to Solve Scientific Challenges With Social Impact

ScaLibBDP