Tuesday, June 25th 2024
AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100
A new startup emerged out of stealth mode today to power the next generation of generative AI. Etched is a company that makes an application-specific integrated circuit (ASIC) to process "Transformers." The transformer is an architecture for designing deep learning models developed by Google and is now the powerhouse behind models like OpenAI's GPT-4o in ChatGPT, Anthropic Claude, Google Gemini, and Meta's Llama family. Etched wanted to create an ASIC for processing only the transformer models, making a chip called Sohu. The claim is Sohu outperforms NVIDIA's latest and greatest by an entire order of magnitude. Where a server configuration with eight NVIDIA H100 GPU clusters pushes Llama-3 70B models at 25,000 tokens per second, and the latest eight B200 "Blackwell" GPU cluster pushes 43,000 tokens/s, the eight Sohu clusters manage to output 500,000 tokens per second.
Why is this important? Not only does the ASIC outperform Hopper by 20x and Blackwell by 10x, but it also serves so many tokens per second that it enables an entirely new fleet of AI applications requiring real-time output. The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate. This translates into inefficiency and waste of power, which Etched hopes to solve by building an accelerator dedicated to power transformers (the "T" in GPT) at massive scales. Given that the frontier model development costs more than one billion US dollars, and hardware costs are measured in tens of billions of US Dollars, having an accelerator dedicated to powering a specific application can help advance AI faster. AI researchers often say that "scale is all you need" (resembling the legendary "attention is all you need" paper), and Etched wants to build on that.However, there are some doubts going forward. While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change. For example, this is reminiscent of the crypto mining craze, which brought a few cycles of crypto ASIC miners that are now worthless pieces of sand, like Ethereum miners used to dig the ETH coin on proof of work staking, and now that ETH has transitioned to proof of stake, ETH mining ASICs are worthless.
Nonetheless, Etched wants the success formula to be simple: run transformer-based models on the Sohu ASIC with an open-source software ecosystem and scale it to massive sizes. While details are scarce, we know that the ASIC runs on 144 GB of HBM3E memory, and the chip is manufactured on TSMC's 4 nm process. Enabling AI models with 100 trillion parameters, more than 55x bigger than GPT-4's 1.8 trillion parameter design.
Source:
Etched
Why is this important? Not only does the ASIC outperform Hopper by 20x and Blackwell by 10x, but it also serves so many tokens per second that it enables an entirely new fleet of AI applications requiring real-time output. The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate. This translates into inefficiency and waste of power, which Etched hopes to solve by building an accelerator dedicated to power transformers (the "T" in GPT) at massive scales. Given that the frontier model development costs more than one billion US dollars, and hardware costs are measured in tens of billions of US Dollars, having an accelerator dedicated to powering a specific application can help advance AI faster. AI researchers often say that "scale is all you need" (resembling the legendary "attention is all you need" paper), and Etched wants to build on that.However, there are some doubts going forward. While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change. For example, this is reminiscent of the crypto mining craze, which brought a few cycles of crypto ASIC miners that are now worthless pieces of sand, like Ethereum miners used to dig the ETH coin on proof of work staking, and now that ETH has transitioned to proof of stake, ETH mining ASICs are worthless.
Nonetheless, Etched wants the success formula to be simple: run transformer-based models on the Sohu ASIC with an open-source software ecosystem and scale it to massive sizes. While details are scarce, we know that the ASIC runs on 144 GB of HBM3E memory, and the chip is manufactured on TSMC's 4 nm process. Enabling AI models with 100 trillion parameters, more than 55x bigger than GPT-4's 1.8 trillion parameter design.
37 Comments on AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100
AMD only officially supports HIP / ROCm on their most expensive MI GPUs, which are in the $5,000+ tier. There used to be cards like Rx 580 that worked for a while, but then the latest HIP / ROCm drops support and then bugs start to creep in. So now what? You either throw away the Rx 580 and upgrade to the Vega (or whatever HIP supports), only to find out that Vega64 loses support and its time to upgrade to Rx 7800. Etc. etc.
NVidia's software support simply lasts long enough for your projects to actually work. Case in point: try to run Blender's HIP on an Rx 580 or Vega.
Then, try CUDA on an NVidia 1080 Ti. Which still works.
----------
Even MI level chips, like the MI60, lose support faster than NVidia chips.
During last a couple of years from time to time I hear news like that. It is very impressive, however, experts always look at benchmarks and in most cases companies do Not release benchmarks since it will show Real Performance (!) of a hardware and it could be very different to internal in-house made evaluations!
Next, if you're interested to see more Hardware News like this one take a look at The Linley Group youtube channel:
www.youtube.com/@LinleygroupVideos/videos
The channel has 109 videos ( I watched All of them! ), it is in a frozen state ( last video was uploaded 3 years ago ), and almost every second company was making statements that "...We Made It Better Than NVIDIA!..".
I didn't follow these companies since it would be a waste of time for me but I think that most of them do Not do well. At the same time NVIDIA made $22.6 billion in the last quarter ( ended on April 28th of 2024 ), and NVIDIA makes more and more money.
As anyone knows in any computing project: it's the software that's expensive. The hardware is whatever today. Even at NVidia prices, the software is where the bulk of the costs are going.
AMD has always made fine hardware. It's just the software support that's lacking. And yes, making sure that MI60 works for more than 5 years is important.
In this thread, people are talking about how T100 or other older NVidia cards can be used instead of H100 or other more recent chips.
Do you see anyone, anywhere, ever saying the same about MI25? MI60? MI100?
I get that AMD doesn't have the resources to keep software support on all of their GPUs. But... This crap is important to the people spending $100,000,000+ on software development on GPU platforms. You can't just cut support like AMD does every few years and expect a community to grow.
Eventually, AMD will make enough money to make a stable software platform for its GPUs. ROCm is better but people are still nervous about getting burned by previous losses of software support.
It says a lot about AMD's software support that AMD cannot support a professional level $5000+ card like MI60 as long as Nvidia can support a consumer card like the GTX 1080.blender/intern/cycles/kernel/device/gpu/kernel.h at main · blender/blender · GitHub
This is the CUDA / HIP / OneAPI source code to the Blender Cycles renderer kernel. You can see that its largely the same code between AMD, NVidia and Intel.
Note: AMD's original contributions to Blender were the OpenCL kernels. Which if you haven't noticed, has been completely thrown away by the Blender team by Blender 4.0. Instead, the NVidia CUDA code has remained the same and instead the CUDA code serves as the basis for HIP and OneAPI today.
I dare you to pretend that I don't know much about Blender's kernels or GPU code. I'm not a professional or anything, but I did spend some time studying this code to learn my hobby GPU abilities. I've been reading this code and following its development for years at this point (albeit at a hobby level, but its seriously one of the best demonstrations of how GPU code evolves over time in a real project). The good, the bad, the lessons learned... Blender team has experienced it and they've exerpienced it in public.
AMD's code and optimizations were thrown out with OpenCL. That's the problem, the code reached a dead end and couldn't be built on top of anymore. Blender's CUDA in contrast has over a decade of growth and stability.