• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Google Unveils Seventh-Generation AI Processor: Ironwood

Nomad76

News Editor
Staff member
Joined
May 21, 2024
Messages
1,144 (3.47/day)
Google has rolled out its seventh-generation AI chip, Ironwood, which aims to boost AI application performance. This processor focuses on "inference" computing—the quick calculations needed for chatbot answers and other AI outputs. Ironwood stands as one of the few real options to NVIDIA's leading AI processors coming from Google's ten-year multi-billion-dollar push to develop it. These tensor processing units (TPUs) are exclusively available through Google's cloud service or to its internal engineers.

According to Google Vice President Amin Vahdat, Ironwood combines functions from previously separate designs while increasing memory capacity. The chip can operate in groups of up to 9,216 processors and delivers twice the performance-per-energy ratio compared to last year's Trillium chip. When configured in pods of 9,216 chips, Ironwood delivers 42.5 Exaflops of computing power. This is more than 24 times the computational capacity of El Capitan, currently the world's largest supercomputer, which provides only 1.7 Exaflops per pod.



Ironwood's key features
  • Ironwood perf/watt is 2x relative to Trillium, Google sixth generation tensor processing units (TPU) announced last year
  • Offers 192 GB per chip, 6x that of Trillium
  • Improved HBM bandwidth, reaching 7.2 Tbps per chip, 4.5x of Trillium's
  • Enhanced Inter-Chip Interconnect (ICI) bandwidth, increased to 1.2 Tbps bidirectional, 1.5x of Trillium's

Google uses these proprietary chips to build and deploy its Gemini AI models. The manufacturer producing the Google-designed processors remains undisclosed.

View at TechPowerUp Main Site | Source
 
Joined
Oct 6, 2021
Messages
1,882 (1.46/day)
System Name Raspberry Pi 7 Quantum @ Overclocked.
It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,956 (1.06/day)
It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
Google's numbers are most meaningful to... themselves! Google's internal workloads justify spending billions on ASIC R&D to deploy new accelerators. TPUs also have an advantage of networking / clusters, which is unique to Google system architecture not available to anyone else.

When it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.
 
Joined
Jan 8, 2017
Messages
9,794 (3.24/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there.
That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,956 (1.06/day)
That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.
That too, but the point still stands. You get more done in less time.
 
Joined
May 10, 2023
Messages
831 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.
Indeed, and we do have benchmarks for mixed precision:

That Google cluster would still top the charts.
The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
Google's TPU have a really interesting chip on die that turns dense vectors into sparse ones internally, so those numbers that they're presenting are agnostic to the input being sparse or not.
AMD's issue with extracting the actual theoretical FLOPs is mostly due to their software stack.
Anyhow, Nvidia's B200 manages ~5 TFLOPs on FP8 dense, while the H100 does ~2 TFLOPs on the same conditions. Double those for sparse.

The major point of those TPUs is that Google can make a hecking good use out of them and have their own software stack on top of it, so their TCO is way lower compared to other products.
When it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.
Adding to that, the latest Deepseek was also trained using FP8.
That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.
I'd say it's both, both of your answers are complemental. Smaller data types during training: more FLOPs, less memory usage (which also means bigger models for the same amount of hardware), less interconnect bottlenecks, and quality is often retained.
Not to confuse when a model is trained on FP16 and someone does a random quantization without any training awareness. Training-aware quantization creates smaller models within a margin of error of the original ones, and it's even better when the model is natively trained at a smaller data type.
 
Joined
Jan 3, 2021
Messages
4,067 (2.60/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
With mechanical network switches (OCS) again this time?
 
Joined
Oct 6, 2021
Messages
1,882 (1.46/day)
System Name Raspberry Pi 7 Quantum @ Overclocked.
Indeed, and we do have benchmarks for mixed precision:

That Google cluster would still top the charts.
The article mentions El Captain, which is based on more modern hardware(43k MI300A) with FP8 support, unlike the other super computers in this table.

If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.
 
Joined
May 10, 2023
Messages
831 (1.17/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
The article mentions El Captain, which is based on more modern hardware(43k MI300A) with FP8 support, unlike the other super computers in this table.

If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.
That's the theoretical performance, but El captain hasn't done any run on the HPL-AI bench yet.

FWIW, Google's number is also theoretical, so indeed it's kind of a moot comparison.

Nonetheless, that TPU is pretty close in performance to a B200 (which is the current de-facto fastest GPU for AI), and has slightly better efficiency, so a great product all in all and does create some competition to Nvidia.
 
Top