Wednesday, April 9th 2025

Google Unveils Seventh-Generation AI Processor: Ironwood

Google has rolled out its seventh-generation AI chip, Ironwood, which aims to boost AI application performance. This processor focuses on "inference" computing—the quick calculations needed for chatbot answers and other AI outputs. Ironwood stands as one of the few real options to NVIDIA's leading AI processors coming from Google's ten-year multi-billion-dollar push to develop it. These tensor processing units (TPUs) are exclusively available through Google's cloud service or to its internal engineers.

According to Google Vice President Amin Vahdat, Ironwood combines functions from previously separate designs while increasing memory capacity. The chip can operate in groups of up to 9,216 processors and delivers twice the performance-per-energy ratio compared to last year's Trillium chip. When configured in pods of 9,216 chips, Ironwood delivers 42.5 Exaflops of computing power. This is more than 24 times the computational capacity of El Capitan, currently the world's largest supercomputer, which provides only 1.7 Exaflops per pod.
Ironwood's key features
  • Ironwood perf/watt is 2x relative to Trillium, Google sixth generation tensor processing units (TPU) announced last year
  • Offers 192 GB per chip, 6x that of Trillium
  • Improved HBM bandwidth, reaching 7.2 Tbps per chip, 4.5x of Trillium's
  • Enhanced Inter-Chip Interconnect (ICI) bandwidth, increased to 1.2 Tbps bidirectional, 1.5x of Trillium's
Google uses these proprietary chips to build and deploy its Gemini AI models. The manufacturer producing the Google-designed processors remains undisclosed.
Sources: Reuters, Google Blog
Add your own comment

8 Comments on Google Unveils Seventh-Generation AI Processor: Ironwood

#1
Denver
It's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
Posted on Reply
#2
AleksandarK
News Editor
DenverIt's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.

The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
Google's numbers are most meaningful to... themselves! Google's internal workloads justify spending billions on ASIC R&D to deploy new accelerators. TPUs also have an advantage of networking / clusters, which is unique to Google system architecture not available to anyone else.

When it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.
Posted on Reply
#3
Vya Domus
AleksandarKAlso NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there.
That's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.
Posted on Reply
#4
AleksandarK
News Editor
Vya DomusThat's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.
That too, but the point still stands. You get more done in less time.
Posted on Reply
#5
igormp
DenverIt's not really a fair comparison. Supercomputers typically use FP64 as their performance metric, while the AI industry relies on lower-precision formats like FP8 or even FP4.
Indeed, and we do have benchmarks for mixed precision:
hpl-mxp.org/results.md

That Google cluster would still top the charts.
DenverThe MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
Google's TPU have a really interesting chip on die that turns dense vectors into sparse ones internally, so those numbers that they're presenting are agnostic to the input being sparse or not.
AMD's issue with extracting the actual theoretical FLOPs is mostly due to their software stack.
Anyhow, Nvidia's B200 manages ~5 TFLOPs on FP8 dense, while the H100 does ~2 TFLOPs on the same conditions. Double those for sparse.

The major point of those TPUs is that Google can make a hecking good use out of them and have their own software stack on top of it, so their TCO is way lower compared to other products.
AleksandarKWhen it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.
Adding to that, the latest Deepseek was also trained using FP8.
Vya DomusThat's not really accurate, the main motivation is to reduce memory footprint and memory bandwidth requirements.
I'd say it's both, both of your answers are complemental. Smaller data types during training: more FLOPs, less memory usage (which also means bigger models for the same amount of hardware), less interconnect bottlenecks, and quality is often retained.
Not to confuse when a model is trained on FP16 and someone does a random quantization without any training awareness. Training-aware quantization creates smaller models within a margin of error of the original ones, and it's even better when the model is natively trained at a smaller data type.
Posted on Reply
#6
Wirko
With mechanical network switches (OCS) again this time?
Posted on Reply
#7
Denver
igormpIndeed, and we do have benchmarks for mixed precision:
hpl-mxp.org/results.md

That Google cluster would still top the charts.
The article mentions El Captain, which is based on more modern hardware(43k MI300A) with FP8 support, unlike the other super computers in this table.
en.m.wikipedia.org/wiki/El_Capitan_(supercomputer)

If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.
Posted on Reply
#8
igormp
DenverThe article mentions El Captain, which is based on more modern hardware(43k MI300A) with FP8 support, unlike the other super computers in this table.
en.m.wikipedia.org/wiki/El_Capitan_(supercomputer)

If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.
That's the theoretical performance, but El captain hasn't done any run on the HPL-AI bench yet.

FWIW, Google's number is also theoretical, so indeed it's kind of a moot comparison.

Nonetheless, that TPU is pretty close in performance to a B200 (which is the current de-facto fastest GPU for AI), and has slightly better efficiency, so a great product all in all and does create some competition to Nvidia.
Posted on Reply
Apr 13th, 2025 04:02 EDT change timezone

New Forum Posts

Popular Reviews

Controversial News Posts