• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

AI Startup Etched Unveils Transformer ASIC Claiming 20x Speed-up Over NVIDIA H100

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
3,205 (1.11/day)
A new startup emerged out of stealth mode today to power the next generation of generative AI. Etched is a company that makes an application-specific integrated circuit (ASIC) to process "Transformers." The transformer is an architecture for designing deep learning models developed by Google and is now the powerhouse behind models like OpenAI's GPT-4o in ChatGPT, Anthropic Claude, Google Gemini, and Meta's Llama family. Etched wanted to create an ASIC for processing only the transformer models, making a chip called Sohu. The claim is Sohu outperforms NVIDIA's latest and greatest by an entire order of magnitude. Where a server configuration with eight NVIDIA H100 GPU clusters pushes Llama-3 70B models at 25,000 tokens per second, and the latest eight B200 "Blackwell" GPU cluster pushes 43,000 tokens/s, the eight Sohu clusters manage to output 500,000 tokens per second.

Why is this important? Not only does the ASIC outperform Hopper by 20x and Blackwell by 10x, but it also serves so many tokens per second that it enables an entirely new fleet of AI applications requiring real-time output. The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate. This translates into inefficiency and waste of power, which Etched hopes to solve by building an accelerator dedicated to power transformers (the "T" in GPT) at massive scales. Given that the frontier model development costs more than one billion US dollars, and hardware costs are measured in tens of billions of US Dollars, having an accelerator dedicated to powering a specific application can help advance AI faster. AI researchers often say that "scale is all you need" (resembling the legendary "attention is all you need" paper), and Etched wants to build on that.




However, there are some doubts going forward. While it is generally believed that transformers are the "future" of AI development, having an ASIC solves the problem until the operations change. For example, this is reminiscent of the crypto mining craze, which brought a few cycles of crypto ASIC miners that are now worthless pieces of sand, like Ethereum miners used to dig the ETH coin on proof of work staking, and now that ETH has transitioned to proof of stake, ETH mining ASICs are worthless.

Nonetheless, Etched wants the success formula to be simple: run transformer-based models on the Sohu ASIC with an open-source software ecosystem and scale it to massive sizes. While details are scarce, we know that the ASIC runs on 144 GB of HBM3E memory, and the chip is manufactured on TSMC's 4 nm process. Enabling AI models with 100 trillion parameters, more than 55x bigger than GPT-4's 1.8 trillion parameter design.

View at TechPowerUp Main Site | Source
 
RIP Nvidia?


Working Stock Market GIF by Adult Swim




That would be something...
 
"The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate."

In some cases even less than that. Eventually, the limitations of silicon might compel companies to rethink their GPUs completely, but it's hard to say for sure.
 
GPUs can be used for many things, no matter how inefficient they might look. I think this is the reason why Intel's Gaudi isn't having as much success. While in the short time buying specialized hardware looks as the smart move from any perspective, that hardware could end up as a huge and highly expensive pile of garbage, if things somewhat change, as mentioned in the article. With GPUs you adapt them or just throw them to do other computational tasks.
 
It comes down to $$$ at the end of the day. Can this do the same thing as Nvidia GPUs for the same money or preferably less since they are the relatively unknown and businesses are more comfortable sticking with the known which gives Nvidia the edge.
 
I wonder how long before somebody creates an "AI" company called "Grift".
 
"The Sohu architecture is so efficient that 90% of the FLOPS can be used, while traditional GPUs boast a 30-40% FLOP utilization rate."

In some cases even less than that. Eventually, the limitations of silicon might compel companies to rethink their GPUs completely, but it's hard to say for sure.


Just look at how AMD screwed the 7900XTX in benchmarking, the dual issue doesn't work unless its verbosely in the code, meaning while it performs great with game ready drivers generic benchmarks or unaware software suffers at half the performance in many situations. GPU hardware is slowly merging on one standard, like X86-64 or ARM is, pretty soon its going to be like ARM hardware, you check the boxes for your application and silicon or whatever substrate is shipped to you.


I wonder how long before somebody creates an "AI" company called "Grift".

The milk maids are a milking.......
 
Just look at how AMD screwed the 7900XTX in benchmarking, the dual issue doesn't work unless its verbosely in the code
Nvidia architectures struggle with utilization just as much, AD102 has 30% more FP32 units than Navi31 but is only 20% faster in raster. In fact every architecture does, CPUs included.
 
I wonder if ASIC's coming out or soon to be coming out is the reason for Nvidia stock drop say what you want about tech investors but they are a very tech savvy clued in bunch and maybe more ASIC products are on the way or maybe my tinfoil hat needs pressed, folded and recycled.
 
I wonder if ASIC's coming out or soon to be coming out is the reason for Nvidia stock drop say what you want about tech investors but they are a very tech savvy clued in bunch and maybe more ASIC products are on the way or maybe my tinfoil hat needs pressed, folded and recycled.
I sold all my NVDA stock once it hit 1100 or 110. It was a nice run, I still own AMD stock, I bought that right after Zen 1 was released. But I didnt buy very much of it.
 
There's a ton of architectures that are better than NVidia or GPUs in general for AI.

The fundamental fact is that NVidia GPUs are doing FP16 4x4 matrix multiplications as their basis. You can gain significantly more efficiencies by going 8x8 matrix or 16x16 matrix. (Or go TPU and go a full 256x256 sized matrix). The matricies in these "Deep Learning AI" are all huge, so making bigger-and-bigger matricies at a time leads to more efficiencies in power, area, etc. etc.

The issue is that the 4x4 matrix multiplication was chosen because it fits in a GPU register space. Its the best a general purpose GPU can basically do on NVidia's architecture for various reasons. I'd expect that if a few more registers (or 64-way CDNA cores from AMD) were used, then maybe 8x4 or maybe 8x8 sizes could be possible, but even AMD is doing 4x4 matrix sizes on their GPUs. So 4x4 is it.

Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.

No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.
 
GPUs can be used for many things, no matter how inefficient they might look. I think this is the reason why Intel's Gaudi isn't having as much success. While in the short time buying specialized hardware looks as the smart move from any perspective, that hardware could end up as a huge and highly expensive pile of garbage, if things somewhat change, as mentioned in the article. With GPUs you adapt them or just throw them to do other computational tasks.
When it gets to that point with an ASIC, you are probably many GPU generations ahead anyway, so keep buying GPU's for extortionate pricing or buy an ASIC and replace it when it becomes obsolete? people talking like GPU's don't become obsolete and become e-waste.... they indeed do when performance/efficiency/instruction sets/API's etc are behind the latest generation
 
There's a ton of architectures that are better than NVidia or GPUs in general for AI.

The fundamental fact is that NVidia GPUs are doing FP16 4x4 matrix multiplications as their basis. You can gain significantly more efficiencies by going 8x8 matrix or 16x16 matrix. (Or go TPU and go a full 256x256 sized matrix). The matricies in these "Deep Learning AI" are all huge, so making bigger-and-bigger matricies at a time leads to more efficiencies in power, area, etc. etc.

The issue is that the 4x4 matrix multiplication was chosen because it fits in a GPU register space. Its the best a general purpose GPU can basically do on NVidia's architecture for various reasons. I'd expect that if a few more registers (or 64-way CDNA cores from AMD) were used, then maybe 8x4 or maybe 8x8 sizes could be possible, but even AMD is doing 4x4 matrix sizes on their GPUs. So 4x4 is it.

Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.

No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.

Nobody is hedging with ASICs that cost more than $30000 which are only worth that price at AI. Which is why AMD is #1 on Top 500 and Intel is #2.

You are not doing really much else with A100s and later, because they are not V100s. They are very, very much specialized for AI calculations and economically worthless for anything else.
 
Nobody is hedging with ASICs that cost more than $30000 which are only worth that price at AI. Which is why AMD is #1 on Top 500 and Intel is #2.

You are not doing really much else with A100s and later, because they are not V100s. They are very, very much specialized for AI calculations and economically worthless for anything else.

A100 and H100 are still better at FP64 and FP32 than their predecessors. They're outrageously expensive because of the AI chips, but the overall GPU-performance (aka: traditional FP64 physics modeling performance) is still outstanding.

As such, the H100 is still a hedge. If AI collapses tomorrow, I'd rather have an H100 than an "Etched" AI ASIC. Despite being a hedge, the H100 is still the market leader in practical performance, thanks to all the software optimizations in CUDA. (Even if the fundamental organization of the low-level 4x4 Matrix Multiplication routines are much smaller and less efficient than large 8x8 or 16x16 sized competitors).
 
As such, the H100 is still a hedge.
I really doubt it, things like MI300 look to be much faster in general purpose compute and likely a lot cheaper, if demand for ML drops off a cliff you don't want these on your hand, it will take ages till you break ROI.

Nvidia really doesn't treat these as anything more than ML accelerators despite them still being "GPUs" technically, they have far inferior FP64/FP16 performance compared to MI300 for example.
 
GPUs can be used for many things, no matter how inefficient they might look. I think this is the reason why Intel's Gaudi isn't having as much success. While in the short time buying specialized hardware looks as the smart move from any perspective, that hardware could end up as a huge and highly expensive pile of garbage, if things somewhat change, as mentioned in the article. With GPUs you adapt them or just throw them to do other computational tasks.

Let's put things a different way, everything is an ASIC, the A (application) can just be more or less generic. A gpu is an ASIC designed for a wide range of applications, this startup thingy is designed for a very specific set of instructions, a TPU or NPU is not as generic as a gpu but also not as constrained as what's tipically referred as ASIC like this thingy. Right now everyone is using nvidia gpus because the software stack is very robust and things are still developing very quickly to become tied to a specific instruction set.

That will eventually change.

I wonder if ASIC's coming out or soon to be coming out is the reason for Nvidia stock drop say what you want about tech investors but they are a very tech savvy clued in bunch and maybe more ASIC products are on the way or maybe my tinfoil hat needs pressed, folded and recycled.

Nah, they're just as dumb as anyone else, otherwise nvidia wouldn't be the most valuable company in the world right now.

Anyone can just take a bigger fundamental matrix, write software that efficiently splits up the work as 8x8 or 16x16 (or Google TPU it to 256x256 splits) and get far better efficiency. Its not a secret and such "systolic arrays" are cake to do from an FPGA perspective. The issue is that these bigger architectures are "not GPUs" anymore, and will be useless outside of AI. And furthermore, you have even more competitors (Google TPU in particular) who you actually should be gunning for.

How does intel XMX architecture fare with that?

No one is buying NVidia GPUs to lead in AI. They're buying GPUs so that they have something else to do if the AI bubble pops. Its a hedged bet. If you go 100% AI with your ASIC chip (like Google or this "Etched" company), you're absolutely going to get eff'd when the AI bubble pops, as all those chips suddenly become worthless. The NVidia GPUs will lose valuation, but there's still other compute projects you can do with them afterwards.

I don't think it's about hedging their bets, i think it's just a case of what's available and easy to start with because of all the work nvidia already put towards a robust software stack.
 
Are any of those claimed results verified by somebody??
I can claim the sea and the sun, but with actuall proof and 3rd party confirmation, I'm just dust in the wind...
 
Are any of those claimed results verified by somebody??
I can claim the sea and the sun, but with actuall proof and 3rd party confirmation, I'm just dust in the wind...

The benefits, and downsides, of a textbook systolic array architecture are well known and well studied.

If you know how memory is going to move, then you can hardwire the data movements to occur. A hardwired data movement is just that: a wire. It's not even a transistor... a dumb wire is the cheapest thing in cost, power and has instantaneous performance.

The problem with hardwired data movements is that they're hardwired. They literally cannot do anything else. If it's add then multiply, the hardwired will only do adds then multiply. (not like a CPU or GPU that can change the order, the data and do other things).

I can certainly believe that a large systolic array is exponentially faster at this job. But their downsides is that its.... Hardwired. Unchanging. Inflexible.

--------

Systolic arrays were deployed as error correction back in the CD-ROM days, since the error correction always had the same order of math in a regular matrix multiplication pattern. Same with Hardware RAID or other ASICs. They've been for decades, superior in performance.

The question of ASIC AI accelerators is not about the performance benefits. The pure question is if it is a worthy $Billion-ish investment and business plan. It's only a good idea if it makes all the money back.
 
Last edited:
Let's put things a different way, everything is an ASIC, the A (application) can just be more or less generic.
The A "application" is nothing more than a generic word that has no meaning in this context until it is placed next to the S "specific" "Application Specific" is very, VERY different to "Application" in this context and I would personally take ASIC and all four of it's letters/words together as one because that is how it is meant to be understood and used. I have not read anything else you wrote beyond this point because all of your arguments hinge on this point about "Application" vs "Application Specific" in discussing this proposed new ASIC product.
 
and what if someone comes up with a better model? probably the Transformers is just the beginning
 
and what if someone comes up with a better model? probably the Transformers is just the beginning
Like mining, you need new machines.

other than mining, you get to keep running that old workload for maybe a cheaper subscription fee as it’s still worth something

before the as a service model, buying something and having it do that specific thing until you bought something new was normal
 
Back
Top