Wednesday, April 9th 2025

Google Unveils Seventh-Generation AI Processor: Ironwood
Google has rolled out its seventh-generation AI chip, Ironwood, which aims to boost AI application performance. This processor focuses on "inference" computing—the quick calculations needed for chatbot answers and other AI outputs. Ironwood stands as one of the few real options to NVIDIA's leading AI processors coming from Google's ten-year multi-billion-dollar push to develop it. These tensor processing units (TPUs) are exclusively available through Google's cloud service or to its internal engineers.
According to Google Vice President Amin Vahdat, Ironwood combines functions from previously separate designs while increasing memory capacity. The chip can operate in groups of up to 9,216 processors and delivers twice the performance-per-energy ratio compared to last year's Trillium chip. When configured in pods of 9,216 chips, Ironwood delivers 42.5 Exaflops of computing power. This is more than 24 times the computational capacity of El Capitan, currently the world's largest supercomputer, which provides only 1.7 Exaflops per pod.Ironwood's key features
Sources:
Reuters, Google Blog
According to Google Vice President Amin Vahdat, Ironwood combines functions from previously separate designs while increasing memory capacity. The chip can operate in groups of up to 9,216 processors and delivers twice the performance-per-energy ratio compared to last year's Trillium chip. When configured in pods of 9,216 chips, Ironwood delivers 42.5 Exaflops of computing power. This is more than 24 times the computational capacity of El Capitan, currently the world's largest supercomputer, which provides only 1.7 Exaflops per pod.Ironwood's key features
- Ironwood perf/watt is 2x relative to Trillium, Google sixth generation tensor processing units (TPU) announced last year
- Offers 192 GB per chip, 6x that of Trillium
- Improved HBM bandwidth, reaching 7.2 Tbps per chip, 4.5x of Trillium's
- Enhanced Inter-Chip Interconnect (ICI) bandwidth, increased to 1.2 Tbps bidirectional, 1.5x of Trillium's
8 Comments on Google Unveils Seventh-Generation AI Processor: Ironwood
The MI300X's dense FP8 performance is around 2615 TFLOPS, and its sparse FP8 is around 5230 TFLOPS. Each individual chip boasts peak compute of 4,614 TFLOPs. This represents a monumental leap in AI capability.
I assume these numbers are also sparse, as the industry generally defaults to reporting the highest figures. The advantage of ASICs should be better effective utilization of TFLOPs in real-world performance, whereas AMD has so far managed to extract only a small percentage of the MI300X's potential.
When it comes to precisions, look at Llama 4, it is trained on FP8 only. Llama 3 was FP16 and FP8, Llama 4 is now FP8 exclusive. Also NVIDIA pushes FP8 and FP4 because we get more FLOPs while mostly keeping the precision there. So FP8 is the future for training, and FP4 is inference.
hpl-mxp.org/results.md
That Google cluster would still top the charts. Google's TPU have a really interesting chip on die that turns dense vectors into sparse ones internally, so those numbers that they're presenting are agnostic to the input being sparse or not.
AMD's issue with extracting the actual theoretical FLOPs is mostly due to their software stack.
Anyhow, Nvidia's B200 manages ~5 TFLOPs on FP8 dense, while the H100 does ~2 TFLOPs on the same conditions. Double those for sparse.
The major point of those TPUs is that Google can make a hecking good use out of them and have their own software stack on top of it, so their TCO is way lower compared to other products. Adding to that, the latest Deepseek was also trained using FP8. I'd say it's both, both of your answers are complemental. Smaller data types during training: more FLOPs, less memory usage (which also means bigger models for the same amount of hardware), less interconnect bottlenecks, and quality is often retained.
Not to confuse when a model is trained on FP16 and someone does a random quantization without any training awareness. Training-aware quantization creates smaller models within a margin of error of the original ones, and it's even better when the model is natively trained at a smaller data type.
en.m.wikipedia.org/wiki/El_Capitan_(supercomputer)
If each AMD MI300A chip delivers 2 petaflops of FP8 performance , then 43,808 units would achieve a total of 87 exaflops. So the article's claim as well as yours is incorrect.
FWIW, Google's number is also theoretical, so indeed it's kind of a moot comparison.
Nonetheless, that TPU is pretty close in performance to a B200 (which is the current de-facto fastest GPU for AI), and has slightly better efficiency, so a great product all in all and does create some competition to Nvidia.