Tuesday, June 25th 2024
AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100
A new startup emerged out of stealth mode today to power the next generation of generative AI. Etched is a company that makes an application-specific integrated circuit (ASIC) to process "Transformers." The transformer is an architecture for designing deep learning models developed by Google and is now the powerhouse behind models like OpenAI's GPT-4o in ChatGPT, Anthropic Claude, Google Gemini, and Meta's Llama family. Etched wanted to create an ASIC for processing only the transformer models, making a chip called Sohu. The claim is Sohu outperforms NVIDIA's latest and greatest by an entire order of magnitude. Where a server configuration with eight NVIDIA H100 GPU clusters pushes Llama-3 70B models at 25,000 tokens per second, and the latest eight B200 "Blackwell" GPU cluster pushes 43,000 tokens/s, the eight Sohu clusters manage to output 500,000 tokens per second.
Why is this important? Not only does the ASIC outperform Hopper by 20x and Blackwell by 10x, but it also serves so many tokens per second that it enables an entirely new fleet of AI applications requiring real-time output. The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate. This translates into inefficiency and waste of power, which Etched hopes to solve by building an accelerator dedicated to power transformers (the "T" in GPT) at massive scales. Given that the frontier model development costs more than one billion US dollars, and hardware costs are measured in tens of billions of US Dollars, having an accelerator dedicated to powering a specific application can help advance AI faster. AI researchers often say that "scale is all you need" (resembling the legendary "attention is all you need" paper), and Etched wants to build on that.However, there are some doubts going forward. While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change. For example, this is reminiscent of the crypto mining craze, which brought a few cycles of crypto ASIC miners that are now worthless pieces of sand, like Ethereum miners used to dig the ETH coin on proof of work staking, and now that ETH has transitioned to proof of stake, ETH mining ASICs are worthless.
Nonetheless, Etched wants the success formula to be simple: run transformer-based models on the Sohu ASIC with an open-source software ecosystem and scale it to massive sizes. While details are scarce, we know that the ASIC runs on 144 GB of HBM3E memory, and the chip is manufactured on TSMC's 4 nm process. Enabling AI models with 100 trillion parameters, more than 55x bigger than GPT-4's 1.8 trillion parameter design.
Source:
Etched
Why is this important? Not only does the ASIC outperform Hopper by 20x and Blackwell by 10x, but it also serves so many tokens per second that it enables an entirely new fleet of AI applications requiring real-time output. The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate. This translates into inefficiency and waste of power, which Etched hopes to solve by building an accelerator dedicated to power transformers (the "T" in GPT) at massive scales. Given that the frontier model development costs more than one billion US dollars, and hardware costs are measured in tens of billions of US Dollars, having an accelerator dedicated to powering a specific application can help advance AI faster. AI researchers often say that "scale is all you need" (resembling the legendary "attention is all you need" paper), and Etched wants to build on that.However, there are some doubts going forward. While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change. For example, this is reminiscent of the crypto mining craze, which brought a few cycles of crypto ASIC miners that are now worthless pieces of sand, like Ethereum miners used to dig the ETH coin on proof of work staking, and now that ETH has transitioned to proof of stake, ETH mining ASICs are worthless.
Nonetheless, Etched wants the success formula to be simple: run transformer-based models on the Sohu ASIC with an open-source software ecosystem and scale it to massive sizes. While details are scarce, we know that the ASIC runs on 144 GB of HBM3E memory, and the chip is manufactured on TSMC's 4 nm process. Enabling AI models with 100 trillion parameters, more than 55x bigger than GPT-4's 1.8 trillion parameter design.
37 Comments on AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100
That would be something...
In some cases even less than that. Eventually, the limitations of silicon might compel companies to rethink their GPUs completely, but it's hard to say for sure.
The fundamental fact is that NVidia GPUs are doing FP16 4x4 matrix multiplications as their basis. You can gain significantly more efficiencies by going 8x8 matrix or 16x16 matrix. (Or go TPU and go a full 256x256 sized matrix). The matricies in these "Deep Learning AI" are all huge, so making bigger-and-bigger matricies at a time leads to more efficiencies in power, area, etc. etc.
The issue is that the 4x4 matrix multiplication was chosen because it fits in a GPU register space. Its the best a general purpose GPU can basically do on NVidia's architecture for various reasons. I'd expect that if a few more registers (or 64-way CDNA cores from AMD) were used, then maybe 8x4 or maybe 8x8 sizes could be possible, but even AMD is doing 4x4 matrix sizes on their GPUs. So 4x4 is it.
Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.
No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.
You are not doing really much else with A100s and later, because they are not V100s. They are very, very much specialized for AI calculations and economically worthless for anything else.
As such, the H100 is still a hedge. If AI collapses tomorrow, I'd rather have an H100 than an "Etched" AI ASIC. Despite being a hedge, the H100 is still the market leader in practical performance, thanks to all the software optimizations in CUDA. (Even if the fundamental organization of the low-level 4x4 Matrix Multiplication routines are much smaller and less efficient than large 8x8 or 16x16 sized competitors).
Nvidia really doesn't treat these as anything more than ML accelerators despite them still being "GPUs" technically, they have far inferior FP64/FP16 performance compared to MI300 for example.
That will eventually change. Nah, they're just as dumb as anyone else, otherwise nvidia wouldn't be the most valuable company in the world right now. How does intel XMX architecture fare with that? I don't think it's about hedging their bets, i think it's just a case of what's available and easy to start with because of all the work nvidia already put towards a robust software stack.
I can claim the sea and the sun, but with actuall proof and 3rd party confirmation, I'm just dust in the wind...
If you know how memory is going to move, then you can hardwire the data movements to occur. A hardwired data movement is just that: a wire. It's not even a transistor... a dumb wire is the cheapest thing in cost, power and has instantaneous performance.
The problem with hardwired data movements is that they're hardwired. They literally cannot do anything else. If it's add then multiply, the hardwired will only do adds then multiply. (not like a CPU or GPU that can change the order, the data and do other things).
I can certainly believe that a large systolic array is exponentially faster at this job. But their downsides is that its.... Hardwired. Unchanging. Inflexible.
--------
Systolic arrays were deployed as error correction back in the CD-ROM days, since the error correction always had the same order of math in a regular matrix multiplication pattern. Same with Hardware RAID or other ASICs. They've been for decades, superior in performance.
The question of ASIC AI accelerators is not about the performance benefits. The pure question is if it is a worthy $Billion-ish investment and business plan. It's only a good idea if it makes all the money back.
other than mining, you get to keep running that old workload for maybe a cheaper subscription fee as it’s still worth something
before the as a service model, buying something and having it do that specific thing until you bought something new was normal
For the most part, people don't want to port off CUDA for minor gains that AMD's hardware represents. The HPC / Supercomputer guys probably aren't even using ROCm for the most part, but are instead writing programs at higher-levels and relying upon a smaller team of specialists to port just elements of their kernel to ROCm one step at a time. (A structure only possible because National Labs have much more $$$ to afford specialist programmers like this).
I think AMD is making good progress. They've found that NVidia is lagging on traditional SIMD compute and have carved out a niche for themselves. But NVidia still "wins" because of the overall software package in practice.