• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Gaudi2 Accelerator Beats NVIDIA H100 at Stable Diffusion 3 by 55%

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,297 (7.53/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Stability AI, the developers behind the popular Stable Diffusion generative AI model, have run some first-party performance benchmarks for Stable Diffusion 3 using popular data-center AI GPUs, including the NVIDIA H100 "Hopper" 80 GB, A100 "Ampere" 80 GB, and Intel's Gaudi2 96 GB accelerator. Unlike the H100, which is a super-scalar CUDA+Tensor core GPU; the Gaudi2 is purpose-built to accelerate generative AI and LLMs. Stability AI published its performance findings in a blog post, which reveals that the Intel Gaudi2 96 GB is posting a roughly 56% higher performance than the H100 80 GB.

With 2 nodes, 16 accelerators, and a constant batch size of 16 per accelerator (256 in all), the Intel Gaudi2 array is able to generate 927 images per second, compared to 595 images for the H100 array, and 381 images per second for the A100 array, keeping accelerator and node counts constant. Scaling things up a notch to 32 nodes, and 256 accelerators or a batch size of 16 per accelerator (total batch size of 4,096), the Gaudi2 array is posting 12,654 images per second; or 49.4 images per-second per-device; compared to 3,992 images per second or 15.6 images per-second per-device for the older-gen A100 "Ampere" array.



There is a big caveat to this, and that is the results were obtained using the base PyTorch; Stability AI admits that with the TensorRT optimization, A100 chips produce images up to 40% faster than Gaudi2. "On inference tests with the Stable Diffusion 3 8B parameter model the Gaudi2 chips offer inference speed similar to Nvidia A100 chips using base PyTorch. However, with TensorRT optimization, the A100 chips produce images 40% faster than Gaudi2. We anticipate that with further optimization, Gaudi2 will soon outperform A100s on this model. In earlier tests on our SDXL model with base PyTorch, Gaudi2 generates a 1024x1024 image in 30 steps in 3.2 seconds, versus 3.6 seconds for PyTorch on A100s and 2.7 seconds for a generation with TensorRT on an A100." Stability AI credits the faster interconnect and larger 96 GB memory as making the Intel chips competitive.

Stability AI plans to implement the Gaudi2 into Stability Cloud.

View at TechPowerUp Main Site | Source
 

Space Lynx

Astronaut
Joined
Oct 17, 2014
Messages
17,425 (4.69/day)
Location
Kepler-186f
Processor 7800X3D -25 all core
Motherboard B650 Steel Legend
Cooling Frost Commander 140
Video Card(s) Merc 310 7900 XT @3100 core -.75v
Display(s) Agon 27" QD-OLED Glossy 240hz 1440p
Case NZXT H710 (Red/Black)
Audio Device(s) Asgard 2, Modi 3, HD58X
Power Supply Corsair RM850x Gold
regardless of the caveat, competition is heating up baby. I wonder what nvidia stock price will look like 1-2 years from now when more competition fabs are online and running.
 

Fourstaff

Moderator
Staff member
Joined
Nov 29, 2009
Messages
10,079 (1.83/day)
Location
Home
System Name Orange! // ItchyHands
Processor 3570K // 10400F
Motherboard ASRock z77 Extreme4 // TUF Gaming B460M-Plus
Cooling Stock // Stock
Memory 2x4Gb 1600Mhz CL9 Corsair XMS3 // 2x8Gb 3200 Mhz XPG D41
Video Card(s) Sapphire Nitro+ RX 570 // Asus TUF RTX 2070
Storage Samsung 840 250Gb // SX8200 480GB
Display(s) LG 22EA53VQ // Philips 275M QHD
Case NZXT Phantom 410 Black/Orange // Tecware Forge M
Power Supply Corsair CXM500w // CM MWE 600w
Looks like Intel needs their equivalent of TensorRT package. Wonder how long that will take to design and publish.
 

wolf

Better Than Native
Joined
May 7, 2007
Messages
8,246 (1.28/day)
System Name MightyX
Processor Ryzen 9800X3D
Motherboard Gigabyte X650I AX
Cooling Scythe Fuma 2
Memory 32GB DDR5 6000 CL30
Video Card(s) Asus TUF RTX3080 Deshrouded
Storage WD Black SN850X 2TB
Display(s) LG 42C2 4K OLED
Case Coolermaster NR200P
Audio Device(s) LG SN5Y / Focal Clear
Power Supply Corsair SF750 Platinum
Mouse Corsair Dark Core RBG Pro SE
Keyboard Glorious GMMK Compact w/pudding
VR HMD Meta Quest 3
Software case populated with Artic P12's
Benchmark Scores 4k120 OLED Gsync bliss
Seems like Nvidia need to focus more of their chip on Tensor / AI rather than CUDA, at least for this specific AI focused application? Correct me if I'm wrong, but I'd wager the H/A100 are the more general purpose solutions, with a broader range of uses and applications that can be accelerated by them.

I don't really know all that much about the data center specific hardware like this though, it could just be than Intel has the better overall solution and you know what, good on them too.
 
Joined
Jan 2, 2019
Messages
153 (0.07/day)
>>...We anticipate that with further optimization, Gaudi2 will soon outperform A100s...

There are several versions of NVIDIA x100-series accelerators, that is Non-PCI-e and PCI-e, and it is Not clear what was actually used.
NVIDIA A100 accelerators are almost 4-year-old ( released in 2020 ).
Tests using PyTorch instead of TensorRT can Not be considered seriously for NVIDIA x100-series accelerators.
The 1st test for Intel Gaudi2 uses twice more accelerators than NVIDIA accelerators ( 32 vs. 16 / final results, that is images per sec, should be normalized ).

>>...Stability AI admits that with the TensorRT optimization, A100 chips produce images up to 40% faster than Gaudi2...

Once again, Tests using PyTorch instead of TensorRT can Not be considered seriously for NVIDIA x100-series accelerators.
 

Essaudio

New Member
Joined
Aug 25, 2022
Messages
7 (0.01/day)
The goal of this was probably to give Nvidia and other purchasers a wake up call. I agree that the results seem. little biased but also point to an upcoming future where there are more good optio for acceleration. I doubt it will make nvidia change course much until they have to if that was the goal.
 
Joined
Mar 12, 2024
Messages
58 (0.20/day)
System Name SOCIETY
Processor AMD Ryzen 9 7800x3D
Motherboard MSI MAG X670E TOMAHAWK
Cooling Arctic Liquid Freezer II 420
Memory 64GB 6000mhz
Video Card(s) Nvidia RTX 3090
Storage WD SN850X 4TB, Micron 1100 2TB, ZFS NAS over 10gbe network
Display(s) 27" Dell S2721DGF, 24" ASUS IPS, 24" Dell IPS
Case Corsair 750D
Power Supply Cooler Master 1200W Gold
Mouse Razer Deathadder
Keyboard ROG Falchion
VR HMD Pimax 8KX
Software Windows 10 with Debian VM
As long as this translates into something I can use at home! Give us battlemage!
Also first I've heard of stable diffusion 3. Hope that one is good, because a lot of people are still on 1.5, having skipped 2 and not had the processing power to tune XL
 
Joined
May 3, 2018
Messages
2,881 (1.19/day)
So what, tell us all about the Intel software environment and their history of support. Who uses SYCL and most in the field believe Intel will drop it after a few years anyway. They don't trust Intel, but they do trust AMD and ROCm/HIP. This is the only true competition they see for Nvidia.
 
Last edited:
Top