• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Reports Suggest DeepSeek Running Inference on Huawei Ascend 910C AI GPUs

T0@st

News Editor
Joined
Mar 7, 2023
Messages
2,289 (3.29/day)
Location
South East, UK
Huawei's Ascend 910C AI chip was positioned as one of the better Chinese-developed alternatives to NVIDIA's H100 accelerator—reports from last autumn suggested that samples were being sent to highly important customers. The likes of Alibaba, Baidu, and Tencent have long relied on Team Green enterprise hardware for all manner of AI crunching, but trade sanctions have severely limited the supply and potency of Western-developed AI chips. NVIDIA's region-specific B20 "Blackwell" accelerator is due for release this year, but industry watchdogs reckon that the Ascend 910C AI GPU is a strong rival. The latest online rumblings have pointed to another major Huawei customer—DeepSeek—having Ascend silicon in their back pockets.

DeepSeek's recent unveiling of its R1 open-source large language model has disrupted international AI markets. A lot of press attention has focused on DeepSeek's CEO stating that his team can access up to 50,000 NVIDIA H100 GPUs, but many have not looked into the company's (alleged) pool of natively-made chips. Yesterday, Alexander Doria—an LLM enthusiast—shared an interesting insight: "I feel this should be a much bigger story—DeepSeek has trained on NVIDIA H800, but is running inference on the new home Chinese chips made by Huawei, the 910C." Experts believe that there will be a plentiful supply of Ascend 910C GPUs—estimates from last September posit that 70,000 chips (worth around $2 billion) were in the mass production pipeline. Additionally, industry whispers suggest that Huawei is already working on a—presumably, even more powerful—successor.



View at TechPowerUp Main Site | Source
 
Joined
Feb 11, 2009
Messages
5,656 (0.97/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
"Alexander Doria—an LLM enthusiast" ah so literally the most boring person in the world
 
Joined
Dec 24, 2022
Messages
105 (0.14/day)
Processor Ryzen 5 5600
Motherboard ASRock B450M Steel Legend
Cooling bequiet! Pure Rock Slim (BK008)
Memory 16GB DDR4 GoodRAM
Video Card(s) ASUS Expedition RX570 4GB
Storage WD Blue 500GB SSD
Display(s) iiyama ProLite T2252MTS
Case CoolerMaster Silencio 352
Power Supply bequiet! Pure Power 12M 650W
Mouse Logitech M590
Keyboard Logitech K270
Software Linux Mint
Would their choce of the Nvidia assembly language for R1 be influenced by the hardware? I don’t know much about AI training in general. It's interesting when you read the training was done on Huawei's hardware, however.
 
Joined
Mar 16, 2017
Messages
252 (0.09/day)
Location
behind you
Processor Threadripper 1950X
Motherboard ASRock X399 Professional Gaming
Cooling IceGiant ProSiphon Elite
Memory 48GB DDR4 2934MHz
Video Card(s) MSI GTX 1080
Storage 4TB Crucial P3 Plus NVMe, 1TB Samsung 980 NVMe, 1TB Inland NVMe, 2TB Western Digital HDD
Display(s) 2x 4K60
Power Supply Cooler Master Silent Pro M (1000W)
Mouse Corsair Ironclaw Wireless
Keyboard Corsair K70 MK.2
VR HMD HTC Vive Pro
Software Windows 10, QubesOS
Would their choce of the Nvidia assembly language for R1 be influenced by the hardware? I don’t know much about AI training in general. It's interesting when you read the training was done on Huawei's hardware, however.
Training done on Nvidia, inference done on Huawei.
 
Joined
May 10, 2023
Messages
553 (0.88/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
That claim is misleading. That picture that served as the source is only for the distilled, smaller models, some of which you can even run in your smartphone.

Shouldnt it be NPU instead of GPU since there is no graphics being processed with LLMs.
NPUs are solely meant for inference of quantized types (like INT4 and INT8), not that useful for training nor inference with higher precision weights.
Would their choce of the Nvidia assembly language for R1 be influenced by the hardware? I don’t know much about AI training in general. It's interesting when you read the training was done on Huawei's hardware, however.
They used PTX only for a small portion of the pipeline. From their own paper:
Specifically, we employ customized PTX (Parallel Thread Execution) instructions and auto-tune the communication chunk size, which significantly reduces the use of the L2 cache and the interference to other SMs.
Most of the code was likely in CUDA.
Also keep in mind that OP is talking about inference, not training.
 
Joined
Dec 24, 2022
Messages
105 (0.14/day)
Processor Ryzen 5 5600
Motherboard ASRock B450M Steel Legend
Cooling bequiet! Pure Rock Slim (BK008)
Memory 16GB DDR4 GoodRAM
Video Card(s) ASUS Expedition RX570 4GB
Storage WD Blue 500GB SSD
Display(s) iiyama ProLite T2252MTS
Case CoolerMaster Silencio 352
Power Supply bequiet! Pure Power 12M 650W
Mouse Logitech M590
Keyboard Logitech K270
Software Linux Mint
Training done on Nvidia, inference done on Huawei.

That claim is misleading. That picture that served as the source is only for the distilled, smaller models, some of which you can even run in your smartphone.


NPUs are solely meant for inference of quantized types (like INT4 and INT8), not that useful for training nor inference with higher precision weights.

They used PTX only for a small portion of the pipeline. From their own paper:

Most of the code was likely in CUDA.
Also keep in mind that OP is talking about inference, not training.
I see. So they're limited to US hardware in some way.
Had to check what inference and training mean in regards to AI. I thought it's the same, or English as 2nd language plays a role :)
 
Joined
Mar 16, 2017
Messages
252 (0.09/day)
Location
behind you
Processor Threadripper 1950X
Motherboard ASRock X399 Professional Gaming
Cooling IceGiant ProSiphon Elite
Memory 48GB DDR4 2934MHz
Video Card(s) MSI GTX 1080
Storage 4TB Crucial P3 Plus NVMe, 1TB Samsung 980 NVMe, 1TB Inland NVMe, 2TB Western Digital HDD
Display(s) 2x 4K60
Power Supply Cooler Master Silent Pro M (1000W)
Mouse Corsair Ironclaw Wireless
Keyboard Corsair K70 MK.2
VR HMD HTC Vive Pro
Software Windows 10, QubesOS
I see. So they're limited to US hardware in some way.
Had to check what inference and training mean in regards to AI. I thought it's the same, or English as 2nd language plays a role :)
Basically training is creating the AI model (based on an architectural blueprint) and inference is running an AI model that already exists.
 
Joined
Jun 29, 2023
Messages
118 (0.20/day)
One does reassure itself how he/she can.

Training is more intensive but you perform inference a lot more times.

The hardware functions required for neural networks are basic mathematical operations on huge matrices and vectors. This is not rocket science nor a secret sauce.

The performance difference between a 3/4 nm chip or a 7nm (as I understand the chinese are able to produce) does not seem that important.

And really, trying to have smaller nodes and large numbers of GPU is trying to brute force the problem.

DeepSeek just demonstrated that cleverness and algorithmic optimizations provides much more benefits than brute force.

That's why I think that the usage of nVidia chips is more related to the software stack associated with it (CUDA) than the hardware, as it allows better capacities (to easily add, remove, change) to develop the model. Inference once you have the model with the layers and the weights is straightforward and thus does not need such flexible developpement.
 
Last edited:
Joined
Jan 11, 2022
Messages
1,056 (0.95/day)
DeepSeek just demonstrated that cleverness and algorithmic optimizations provides much more benefits than brute force.
.
that’s the joke, now that cat is out of the bag, everyone can use it and brute force matters again.
the market reaction on this is dumb and extremely short sighted.

problem with the bubble still remains how are they going to make money with it, it’s not 60 queries a month for $20
 
Joined
Jun 29, 2023
Messages
118 (0.20/day)
that’s the joke, now that cat is out of the bag, everyone can use it and brute force matters again.
the market reaction on this is dumb and extremely short sighted.

problem with the bubble still remains how are they going to make money with it, it’s not 60 queries a month for $20

You have to pay for brute force, and that may prove difficult if you are competing with a free model you can run at home.

Hence all the media chatter trying to smear it or fearmonger over it.

Also, you can always find better optimizations, and that is not something that the US can somehow restrict China from developping.

And that it is China that did it while the US was stuck on brute force is quite a problem, I think.
 
Last edited:
Top