• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Intel Outs First Xeon Scalable "Sapphire Rapids" Benchmarks, On-package Accelerators Help Catch Up with AMD EPYC

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,300 (7.53/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
Intel in the second day of its InnovatiON event, turned attention to its next-generation Xeon Scalable "Sapphire Rapids" server processors, and demonstrated on-package accelerators. These are fixed-function hardware components that accelerate specific kinds of popular server workloads (i.e. run them faster than a CPU core can). With these, Intel hopes to close the CPU core-count gap it has with AMD EPYC, with the upcoming "Zen 4" EPYC chips expected to launch with up to 96 cores per socket in its conventional variant, and up to 128 cores per socket in its cloud-optimized variant.

Intel's on-package accelerators include AMX (advanced matrix extensions), which accelerate recommendation-engines, natural language processing (NLP), image-recognition, etc; DLB (dynamic load-balancing), which accelerates security-gateway and load-balancing; DSA (data-streaming accelerator), which speeds up the network stack, guest OS, and migration; IAA (in-memory analysis accelerator), which speeds up big-data (Apache Hadoop), IMDB, and warehousing applications; a feature-rich implementation of the AVX-512 instruction-set for a plethora of content-creation and scientific applications; and lastly, the QAT (QuickAssist Technology), with speed-ups for data compression, OpenSSL, nginx, IPsec, etc. Unlike "Ice Lake-SP," QAT is now implemented on the processor package instead of the PCH.



Intel's benchmarks for the Xeon Scalable "Sapphire Rapids" focus on each of the above accelerators, and how they help the processor work "smarter" than AMD EPYC and overcome the CPU core deficit; as well as save power along the way. The first set of benchmarks focus on Intel AMX, and the speed-up it offers with ResNet50v1.5 Tensorflow AI Image Classification benchmarks. The second set of benchmarks showcases the data-compression speed-up with QAT implemented at scale, with QATzip Level 1. The third set focuses on big-data analysis accelerated by IAA, using ClickHouse and RocksDB. The next set shows off the SPDK NVMe TCP Storage Performance acceleration provided by DSA. QAT is shown accelerating nginx and IPsec encryption.



View at TechPowerUp Main Site
 
Joined
Feb 15, 2019
Messages
1,666 (0.78/day)
System Name Personal Gaming Rig
Processor Ryzen 7800X3D
Motherboard MSI X670E Carbon
Cooling MO-RA 3 420
Memory 32GB 6000MHz
Video Card(s) RTX 4090 ICHILL FROSTBITE ULTRA
Storage 4x 2TB Nvme
Display(s) Samsung G8 OLED
Case Silverstone FT04
I thought Intel were making General compute CPU, but instead they made a bunch of ASIC squeezed together ?
 
Joined
Feb 11, 2009
Messages
5,572 (0.96/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
I thought Intel were making General compute CPU, but instead they made a bunch of ASIC squeezed together ?

yeah, it feels a bit...silly...and inefficient.
Honestly kinda disappointed with something like this with Alderlake and Rocketlake being pretty solid.
But I guess this has been in development hell for a while and only now coming out, dated on release.
 
Joined
Apr 30, 2020
Messages
999 (0.59/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 32Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
yeah, it feels a bit...silly...and inefficient.
Honestly kinda disappointed with something like this with Alderlake and Rocketlake being pretty solid.
But I guess this has been in development hell for a while and only now coming out, dated on release.
Why when they're Probably build some kinf od A.I inside the cpu to run certain Programing code that runs much faster on the ASIC/or others?
I said this was the future and people thought it was silly. -_-
 

AsRock

TPU addict
Joined
Jun 23, 2007
Messages
19,107 (2.99/day)
Location
UK\USA
yeah, it feels a bit...silly...and inefficient.
Honestly kinda disappointed with something like this with Alderlake and Rocketlake being pretty solid.
But I guess this has been in development hell for a while and only now coming out, dated on release.

Maybe they aiming to a cut of the mining ?.
 
Joined
Dec 7, 2020
Messages
62 (0.04/day)
Color me skeptical.
It looks like it's the last moment to publish data showing meaningful improvement from dedicated accelerators: before the Genoa(-X) hits market at the same time SPR does (or earlier).
Genoa (-X) will have 1.5x cores at higher IPC (vs. Milan shown as comparison) plus AVX-512.
Judging by Milan's (non-X) results it looks like Genoa(-X) can be:
- probably competitive with AMX at ResNet (because of AVX-512 and 1.5x cores)
- slightly better than IAA at LZ4 (more cores)
- competitive with DSA at CRC32, at least on smaller blocks (charts look like they selected comparison on larger blocks, probably because of some overhead or latency, as 16k improvements are lower);
- 8 fewer occupied cores at QAT (or even 50 in the second chart) is somewhat moot if you have 32 more cores on package (so +64 in 2P); and the cores are faster...
So probably the QATzip will be the only one not matched/overpowered by Genoa's general-purpose cores; and Genoa won't need software rewrites.

And it's kind of worrying (for SPR) to see Milan slightly better than SPR at QAT/OpenSSL in OOB configuration (same result at 67 vs. 70 cores) and in compression (both zlib and ISA/L).
 
Joined
Oct 27, 2009
Messages
1,190 (0.21/day)
Location
Republic of Texas
System Name [H]arbringer
Processor 4x 61XX ES @3.5Ghz (48cores)
Motherboard SM GL
Cooling 3x xspc rx360, rx240, 4x DT G34 snipers, D5 pump.
Memory 16x gskill DDR3 1600 cas6 2gb
Video Card(s) blah bigadv folder no gfx needed
Storage 32GB Sammy SSD
Display(s) headless
Case Xigmatek Elysium (whats left of it)
Audio Device(s) yawn
Power Supply Antec 1200w HCP
Software Ubuntu 10.10
Benchmark Scores http://valid.canardpc.com/show_oc.php?id=1780855 http://www.hwbot.org/submission/2158678 http://ww
Color me skeptical.
It looks like it's the last moment to publish data showing meaningful improvement from dedicated accelerators: before the Genoa(-X) hits market at the same time SPR does (or earlier).
Genoa (-X) will have 1.5x cores at higher IPC (vs. Milan shown as comparison) plus AVX-512.
Judging by Milan's (non-X) results it looks like Genoa(-X) can be:
- probably competitive with AMX at ResNet (because of AVX-512 and 1.5x cores)
- slightly better than IAA at LZ4 (more cores)
- competitive with DSA at CRC32, at least on smaller blocks (charts look like they selected comparison on larger blocks, probably because of some overhead or latency, as 16k improvements are lower);
- 8 fewer occupied cores at QAT (or even 50 in the second chart) is somewhat moot if you have 32 more cores on package (so +64 in 2P); and the cores are faster...
So probably the QATzip will be the only one not matched/overpowered by Genoa's general-purpose cores; and Genoa won't need software rewrites.

And it's kind of worrying (for SPR) to see Milan slightly better than SPR at QAT/OpenSSL in OOB configuration (same result at 67 vs. 70 cores) and in compression (both zlib and ISA/L).

Wow, I went back to find a quote and they cleaned it up... Time for way back machine...
Quote was found here...

"We would have liked more of that gap, more of that leadership window for our customers in terms of when we originally forecasted the product to be out and ramping in high volume, but because of the additional platform validation that we're doing, that window is a bit shorter. So it will be leadership — it depends on where the competition lands," (Sandra L. Rivera is executive vice president and general manager of the Datacenter and AI Group at Intel)

Soo Intel Expects to beat Milan, but get stomped by Genoa.
 
Joined
Jul 5, 2019
Messages
318 (0.16/day)
Location
Berlin, Germany
System Name Workhorse
Processor 13900K 5.9 Ghz single core (2x) 5.6 Ghz Allcore @ -0.15v offset / 4.5 Ghz e-core -0.15v offset
Motherboard MSI Z690A-Pro DDR4
Cooling Arctic Liquid Cooler 360 3x Arctic 120 PWM Push + 3x Arctic 140 PWM Pull
Memory 2 x 32GB DDR4-3200-CL16 G.Skill RipJaws V @ 4133 Mhz CL 18-22-42-42-84 2T 1.45v
Video Card(s) RX 6600XT 8GB
Storage PNY CS3030 1TB nvme SSD, 2 x 3TB HDD, 1x 4TB HDD, 1 x 6TB HDD
Display(s) Samsung 34" 3440x1400 60 Hz
Case Coolermaster 690
Audio Device(s) Topping Dx3 Pro / Denon D2000 soon to mod it/Fostex T50RP MK3 custom cable and headband / Bose NC700
Power Supply Enermax Revolution D.F. 850W ATX 2.4
Mouse Logitech G5 / Speedlink Kudos gaming mouse (12 years old)
Keyboard A4Tech G800 (old) / Apple Magic keyboard
Why when they're Probably build some kinf od A.I inside the cpu to run certain Programing code that runs much faster on the ASIC/or others?
I said this was the future and people thought it was silly. -_-
It's not the future.
The problem with ASICs is that they are very inefficient software wise. For each and every new ASIC you need to code from scratch trying to work into strengths of that particular ASIC. This is very time consuming, never mind expensive.

Now imagine that every software company out there which has a strong software suite (Adobe, Autodesk, Blender, various scientific software, etc.), imagine if they had to code from scratch for different ASICs just in hopes that some of them might get used. See? ASICs make sense from hardware point of view, software not so much if at all.

Hell, even the name says it all ASIC - Application-specific integrated circuit; it's an Application-specific hardware, meaning hardware made for a specific application or a small class of those. This is very much opposed to the CPU, which is very generic and can run any software as long as it has a compiler for the CPU architecture (this is pretty much a given nowadays). Where you can as SW company (from my first example) work on your software without having to worry which hardware it will be run on, as you you provide that software in multiple binary or distribution forms, but not for any ASICs (obviously). :laugh:
 
Joined
Apr 30, 2020
Messages
999 (0.59/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 32Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
It's not the future.
The problem with ASICs is that they are very inefficient software wise. For each and every new ASIC you need to code from scratch trying to work into strengths of that particular ASIC. This is very time consuming, never mind expensive.

Now imagine that every software company out there which has a strong software suite (Adobe, Autodesk, Blender, various scientific software, etc.), imagine if they had to code from scratch for different ASICs just in hopes that some of them might get used. See? ASICs make sense from hardware point of view, software not so much if at all.

Hell, even the name says it all ASIC - Application-specific integrated circuit; it's an Application-specific hardware, meaning hardware made for a specific application or a small class of those. This is very much opposed to the CPU, which is very generic and can run any software as long as it has a compiler for the CPU architecture (this is pretty much a given nowadays). Where you can as SW company (from my first example) work on your software without having to worry which hardware it will be run on, as you you provide that software in multiple binary or distribution forms, but not for any ASICs (obviously). :laugh:
I could easily refure this but I'm not.

I would much like some In Order Operations in the instructions of the cpu, however this isn't in possible in the current desin of OoO ( out of Order opertations). I mean sure there are tricks programers can do from trying to get to run in order from the out of order. But why not use the an A.A or algorythom to Sense which works fasters for certain code. Like it would run it in both Out of out order on one try & in roder in another try. Not like the Speculation that's currently in the instuction cache no this would have been measure on the end of the execute not the beginning.
 
Joined
Jul 5, 2019
Messages
318 (0.16/day)
Location
Berlin, Germany
System Name Workhorse
Processor 13900K 5.9 Ghz single core (2x) 5.6 Ghz Allcore @ -0.15v offset / 4.5 Ghz e-core -0.15v offset
Motherboard MSI Z690A-Pro DDR4
Cooling Arctic Liquid Cooler 360 3x Arctic 120 PWM Push + 3x Arctic 140 PWM Pull
Memory 2 x 32GB DDR4-3200-CL16 G.Skill RipJaws V @ 4133 Mhz CL 18-22-42-42-84 2T 1.45v
Video Card(s) RX 6600XT 8GB
Storage PNY CS3030 1TB nvme SSD, 2 x 3TB HDD, 1x 4TB HDD, 1 x 6TB HDD
Display(s) Samsung 34" 3440x1400 60 Hz
Case Coolermaster 690
Audio Device(s) Topping Dx3 Pro / Denon D2000 soon to mod it/Fostex T50RP MK3 custom cable and headband / Bose NC700
Power Supply Enermax Revolution D.F. 850W ATX 2.4
Mouse Logitech G5 / Speedlink Kudos gaming mouse (12 years old)
Keyboard A4Tech G800 (old) / Apple Magic keyboard
What are you talking about?
Out of order would always be faster than in order unless no instructions could be re-ordered, which is almost never the case.
There is a reason why OoO was introduced in the first place. It's to increase performance by eliminating bottlenecks and avoiding NOP operations, avoiding having to wait for result of some instruction before starting work on next instructions.
 
Top