Google Unveils Seventh-Generation AI Processor: Ironwood

Nomad76 · Apr 9, 2025

Google has rolled out its seventh-generation AI chip, Ironwood, which aims to boost AI application performance. This processor focuses on "inference" computing—the quick calculations needed for chatbot answers and other AI outputs. Ironwood stands as one of the few real options to NVIDIA's leading AI processors coming from Google's ten-year multi-billion-dollar push to develop it. These tensor processing units (TPUs) are exclusively available through Google's cloud service or to its internal engineers.

According to Google Vice President Amin Vahdat, Ironwood combines functions from previously separate designs while increasing memory capacity. The chip can operate in groups of up to 9,216 processors and delivers twice the performance-per-energy ratio compared to last year's Trillium chip. When configured in pods of 9,216 chips, Ironwood delivers 42.5 Exaflops of computing power. This is more than 24 times the computational capacity of El Capitan, currently the world's largest supercomputer, which provides only 1.7 Exaflops per pod.

Ironwood's key features

Ironwood perf/watt is 2x relative to Trillium, Google sixth generation tensor processing units (TPU) announced last year
Offers 192 GB per chip, 6x that of Trillium
Improved HBM bandwidth, reaching 7.2 Tbps per chip, 4.5x of Trillium's
Enhanced Inter-Chip Interconnect (ICI) bandwidth, increased to 1.2 Tbps bidirectional, 1.5x of Trillium's

Google uses these proprietary chips to build and deploy its Gemini AI models. The manufacturer producing the Google-designed processors remains undisclosed.

View at TechPowerUp Main Site | Source

Denver · Apr 9, 2025

It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.

AleksandarK · Apr 9, 2025

Denver said:
It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.

Google's numbers are most meaningful to... themselves! Google's internal workloads justify spending billions on ASIC R&D to deploy new accelerators. TPUs also have an advantage of networking / clusters, which is unique to Google system architecture not available to anyone else.

When it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.

Vya Domus · Apr 9, 2025

AleksandarK said:
Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there.

That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.

AleksandarK · Apr 9, 2025

Vya Domus said:
That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.

That too, but the point still stands. You get more done in less time.

igormp · Apr 9, 2025

Denver said:
It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

Indeed, and we do have benchmarks for mixed precision:

Results

HPL-AI Benchmark

hpl-mxp.org

That Google cluster would still top the charts.

Denver said:
The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.

Google's TPU have a really interesting chip on die that turns dense vectors into sparse ones internally, so those numbers that they're presenting are agnostic to the input being sparse or not.
AMD's issue with extracting the actual theoretical FLOPs is mostly due to their software stack.
Anyhow, Nvidia's B200 manages ~5 TFLOPs on FP8 dense, while the H100 does ~2 TFLOPs on the same conditions. Double those for sparse.

The major point of those TPUs is that Google can make a hecking good use out of them and have their own software stack on top of it, so their TCO is way lower compared to other products.

AleksandarK said:
When it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.

Adding to that, the latest Deepseek was also trained using FP8.

Vya Domus said:
That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.

I'd say it's both, both of your answers are complemental. Smaller data types during training: more FLOPs, less memory usage (which also means bigger models for the same amount of hardware), less interconnect bottlenecks, and quality is often retained.
Not to confuse when a model is trained on FP16 and someone does a random quantization without any training awareness. Training-aware quantization creates smaller models within a margin of error of the original ones, and it's even better when the model is natively trained at a smaller data type.

Wirko · Thursday at 1:23 AM

With mechanical network switches (OCS) again this time?

Denver · Thursday at 12:18 PM

igormp said:
Indeed, and we do have benchmarks for mixed precision:

Results

HPL-AI Benchmark

hpl-mxp.org

That Google cluster would still top the charts.

The article mentions El Captain, which is based on more modern hardware(43k MI300A) with FP8 support, unlike the other super computers in this table.

El Capitan (supercomputer) - Wikipedia

en.m.wikipedia.org

If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.

igormp · Thursday at 5:10 PM

Denver said:
The article mentions El Captain, which is based on more modern hardware(43k MI300A) with FP8 support, unlike the other super computers in this table.

El Capitan (supercomputer) - Wikipedia

en.m.wikipedia.org

If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.

That's the theoretical performance, but El captain hasn't done any run on the HPL-AI bench yet.

FWIW, Google's number is also theoretical, so indeed it's kind of a moot comparison.

Nonetheless, that TPU is pretty close in performance to a B200 (which is the current de-facto fastest GPU for AI), and has slightly better efficiency, so a great product all in all and does create some competition to Nvidia.

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

Google Unveils Seventh-Generation AI Processor: Ironwood

Nomad76

News Editor

Denver

AleksandarK

News Editor

Vya Domus

AleksandarK

News Editor

igormp

Results

Wirko

Denver

Results

El Capitan (supercomputer) - Wikipedia

igormp

El Capitan (supercomputer) - Wikipedia