Monday, April 11th 2022

NVIDIA Claims Grace CPU Superchip is 2X Faster Than Intel Ice Lake

Apr 11th, 2022 03:12 Discuss (30 Comments)

When NVIDIA announced its Grace CPU Superchip, the company officially showed its efforts of creating an HPC-oriented processor to compete with Intel and AMD. The Grace CPU Superchip combines two Grace CPU modules that use the NVLink-C2C technology to deliver 144 Arm v9 cores and 1 TB/s of memory bandwidth. Each core is Arm Neoverse N2 Perseus design, configured to achieve the highest throughput and bandwidth. As far as performance is concerned, the only detail NVIDIA provides on its website is the estimated SPECrate 2017_int_base score of over 740. Thanks to the colleges over at Tom's Hardware, we have another performance figure to look at.

NVIDIA has made a slide about comparison with Intel's Ice Lake server processors. One Grace CPU Superchip was compared to two Xeon Platinum 8360Y Ice Lake CPUs configured in a dual-socket server node. The Grace CPU Superchip outperformed the Ice Lake configuration by two times and provided 2.3 times the efficiency in WRF simulation. This HPC application is CPU-bound, allowing the new Grace CPU to show off. This is all thanks to the Arm v9 Neoverse N2 cores pairing efficiently with outstanding performance. NVIDIA made a graph showcasing all HPC applications running on Arm today, with many more to come, which you can see below. Remember that NVIDIA provides this information, so we have to wait for the 2023 launch to see it in action.

Source: Tom's Hardware

Add your own comment

30 Comments on NVIDIA Claims Grace CPU Superchip is 2X Faster Than Intel Ice Lake

ARF

It has been known for a while that the ARM architecture is more efficient than the x86 architectures. This is the reason why all of our smartphones run ARM chips and not x86 chips..

lexluthermiester

ARFIt has been known for a while that the ARM architecture is more efficient than the x86 architectures. This is the reason why all of our smartphones run ARM chips and not x86 chips..

While that is true, X86 often can do more in less time than ARM. There are trade offs with each. ARM is taylor-made for simple instruction work-loads. X86/X65 is made for heavy, complex instruction work-loads.

londiste

Twice the cores, 1TB/s bandwidth vs 0.2TB/s. In a use case that definitely prefers bandwidth, RAM bandwidth for sure but the nice interconnect in the Grace CPU Superchip might also come quite handy.

ARFIt has been known for a while that the ARM architecture is more efficient than the x86 architectures. This is the reason why all of our smartphones run ARM chips and not x86 chips.

Are they? Atoms have been quite on par for a while and top of the line SoCs currently beat x86 with a manufacturing process that is a gen or two ahead which plays a hell of a lot larger part in mobile than it does otherwise. The reason why all our smartphones run ARM chips are not so much technical reasons but have a lot to do with ARM being cheaper and open (which in this context is not necessarily a good thing).

john_

I thought everything today is twice as fast and twice more efficient than Ice lake Xeons.

Punkenjoy

londisteTwice the cores, 1TB/s bandwidth vs 0.2TB/s. In a use case that definitely prefers bandwidth, RAM bandwidth for sure but the nice interconnect in the Grace CPU Superchip might also come quite handy.

Are they? Atoms have been quite on par for a while and top of the line SoCs currently beat x86 with a manufacturing process that is a gen or two ahead which plays a hell of a lot larger part in mobile than it does otherwise. The reason why all our smartphones run ARM chips are not so much technical reasons but have a lot to do with ARM being cheaper and open (which in this context is not necessarily a good thing).

Well, ARM could be more efficient if our CPU were super simple. But theses days, the power impact of running arm binary versus x86-64 binary is marginal.

For CPU that have the same goal, the uArch have minimal impact and it's really the design of the CPU that matter. People have compared High performance CPU efficiency with Cellphone CPU and declared that Arm was more efficient. But when Arm design look at high performance, they already start to be way less power efficient. The M1 by example had to run on a more advanced node than Intel and AMD to stay barely ahead. We will see how comparable Architecture on the same nodes or similar nodes (raptor lake, Zen4) will do against M1.

Like Jim Keller said, The uArch do not really matter a lot these days as CPU are so complex. Once the instruction get decoded, it's a flat field for everyone and arm have no specific advantage after that. The overhead to decode x86-64 vs arm is not significant for a complex CPU like what we have today.

Also Nvidia is joining the trend of comparing old thing with things not even released yet. If That get release in a year from now, icelake xeons will be 2 years old at that point. So i hope that it will be better. (and it's convenient that they don't compare it against EPYC, the actual performance leader right now.)

Vya Domus

ARFIt has been known for a while that the ARM architecture is more efficient than the x86 architectures. This is the reason why all of our smartphones run ARM chips and not x86 chips..

It hardly has anything to do with the fact that it's ARM or x86. You want to prioritize efficiency, you make an architecture that prioritizes efficiency, what ISA it uses is of no real importance.

AdmiralThrawn

The thing I hate most about Tech Companies is their graphs and fancy charts. What the heck is a "Traditional CPU" Show us some real numbers instead of a tailored chart. Literally nobody buys a product based on their fake info so why not just give us a real number?

defaultluser

PunkenjoyWell, ARM could be more efficient if our CPU were super simple. But theses days, the power impact of running arm binary versus x86-64 binary is marginal.

For CPU that have the same goal, the uArch have minimal impact and it's really the design of the CPU that matter. People have compared High performance CPU efficiency with Cellphone CPU and declared that Arm was more efficient. But when Arm design look at high performance, they already start to be way less power efficient. The M1 by example had to run on a more advanced node than Intel and AMD to stay barely ahead. We will see how comparable Architecture on the same nodes or similar nodes (raptor lake, Zen4) will do against M1.

Like Jim Keller said, The uArch do not really matter a lot these days as CPU are so complex. Once the instruction get decoded, it's a flat field for everyone and arm have no specific advantage after that. The overhead to decode x86-64 vs arm is not significant for a complex CPU like what we have today.

Also Nvidia is joining the trend of comparing old thing with things not even released yet. If That get release in a year from now, icelake xeons will be 2 years old at that point. So i hope that it will be better. (and it's convenient that they don't compare it against EPYC, the actual performance leader right now.)

When do you expect to see Sapphire Rapids Server Chips?

Oh yeah, they haven't announced shit

Best-case, Sapphire will arrive at the end of the year (mass-availability halfway through 2023!)

aQi

lexluthermiesterWhile that is true, X86 often can do more in less time than ARM. There are trade offs with each. ARM is taylor-made for simple instruction work-loads. X86/X65 is made for heavy, complex instruction work-loads.

Does that apply to RISC vs CISC architecture differences as well ?

#10

BrainChild510

With NVIDIA sprinting to develop a competitor to Intel's CPU and Intel running up behind AMD/NVIDIA with their own GPU's; which company do you think will reign supreme in terms of sheer compute performance? Looking forward to the Apples to Apples benchmark comparison! My guesses LTT will be one of the first to test them both side by side, but when and Which hardware cpu/gpu components do you think will lead the pack and why?

#11

looniam

AdmiralThrawnThe thing I hate most about Tech Companies is their graphs and fancy charts. What the heck is a "Traditional CPU" Show us some real numbers instead of a tailored chart.
Literally nobody buys a product based on their fake info so why not just give us a real number?

though the post here mentioned an icelake xeon, tom's article is pretty descriptive: (purposely editing in the transitional sentence :p )

(Beware, this is a vendor-provided benchmark result and is based on a simulation of the Grace CPU, so take Nvidia's claims with a grain of salt.)
. . .
And make no mistake, that enhanced memory throughput plays right to the strengths of the Grace CPU Superchip in the Weather Research and Forecasting (WRF) model above. Nvidia says that its simulations of the 144-core Grace chip show that it will be 2X faster and provide 2.3X the power efficiency of two 36-core 72-thread Intel 'Ice Lake' Xeon Platinum 8360Y processors in the WRF simulation. That means we're seeing 144 Arm threads (each on a physical core), facing off with 144 x86 threads (two threads per physical core).

The various permutations of WRF are real-world workloads commonly used for benchmarking, and many of the modules have been ported over for GPU acceleration with CUDA. We followed up with Nvidia about this specific benchmark, and the company says this module hasn't yet been ported over to GPUs, so it is CPU-centric. Additionally, it is very sensitive to memory bandwidth, giving Grace a leg up in both performance and efficiency. Nvidia's estimates are "based on standard NCAR WRF, version 3.9.1.1 ported to Arm, for the IB4 model (a 4km regional forecast of the Iberian peninsula)."

TLDR: it can move data fast. it really says nothing about ipc and consider nvidia is experienced enough via CUDA (they might know some [more?] shortcuts) so forget that.

those guys/gals in marketing get paychecks to earn and simulators to play with. :p

#12

Punkenjoy

defaultluserWhen do you expect to see Sapphire Rapids Server Chips?

Oh yeah, they haven't announced shit

Best-case, Sapphire will arrive at the end of the year (mass-availability halfway through 2023!)

Sapphire rapid will start to ship soon as per intel as the production is currently ramping up. Nvidia like they did with Hopper just did paper launch to try to stay in the lead.

They compare unreleased and not even taped out chip with things that are already available on the market for quite some time.

Company that does that (Not only Nvidia, Intel and AMD did too in the past) are generally company that know they will fall behind and try to get some hype before competitors get released. Company that get huge lead try to get a huge launch day that have impact to get as much as possible mind share.

And that do not resolve the main issue, they are comparing with a year old second place CPU, the lead is currently taken by Milan and Milan-X and at that time, we will have Genoa (Zen-4) available.

aQiDoes that apply to RISC vs CISC architecture differences as well ?

The thing is there are very few real RISC cpu. If you look at how large Arm instruction set grow, it's hard to declare that still a reduced instruction set.

Probably the slight advantages Arm could have for simplicity over x86-64 is the fixed instruction length versus variable instruction length of x86. This allow a bit simpler front end instruction decoding. But again, this have a marginal impact on the overall CPU these days because CPU are huge, massive and complex. If things were much simpler, like in the 90s and early 2000, that could actually made a significant difference. Things were much simpler.

#13

TheinsanegamerN

Oh yeah, we've heard THESE claims before. :rolleyes: It'll turn out to be in one specific benchmark that's be re-optimized for these specific chips and in no way reflects real world performance.

Until nvidia produces working chips that can be bought and verified by third parties I'mma call this a big fat LIE.

#14

dragontamer5788

Vya DomusIt hardly has anything to do with the fact that it's ARM or x86. You want to prioritize efficiency, you make an architecture that prioritizes efficiency, what ISA it uses is of no real importance.

This here, at least for 90% of situations.

Intel vs AMD vs Via/Centaur CPUs shows just how much you can vary CPU-performance and CPU-power-usage.

Similarly: ARM N1-core vs Apple M1 vs Fujitsu a64fx are all different CPU-performance vs CPU-power-usage. Fujitsu A64fx is a freaking GPU-like design of all things (heavily focused on 512-bit SVE instructions) and is the #1 beast in the world for supercomputers currently, but has strict RAM-limitations because its stuck on HBM2 RAM (so ~64GBs of RAM per chip), while other DDR4 or LPDDR5 CPU chips have access to more RAM.

-----

The 10% that matters comes down to memory-model details that almost all programmers are ignorant of. If you know about load-acquire and store-release, maybe you'll like the ARM-instruction set over the x86 instruction set. But this is extremely, extremely niche and irrelevant in the vast, vast, vast majority of programs. In fact, x86's slightly better designed AES/Crypto functions are probably more important in practice. "AESENC" single-instruction to encrypt on x86, while on ARM you gotta "AESE + AESMC" (2-instructions per AES-loop).

-------

ARM N1 / N2 / V1 even don't seem to be as good as current-generation AMD-EPYC or Intel designs IMO. Apple's M1 is the only outstanding design, but even then the M1 has a lot of tradeoffs (absolutely HUGE core, bigger than anyone else's. Apple's M1 is so physically huge it won't scale to higher core counts very easily)

Well... okay. Fujitsu A64fx is an incredible ARM-based design for supercomputers. But almost nobody wants an ARM-chip grafted onto HBM2e RAM. That's just too niche.

#15

ARF

Vya DomusIt hardly has anything to do with the fact that it's ARM or x86. You want to prioritize efficiency, you make an architecture that prioritizes efficiency, what ISA it uses is of no real importance.

lexluthermiesterWhile that is true, X86 often can do more in less time than ARM. There are trade offs with each. ARM is taylor-made for simple instruction work-loads. X86/X65 is made for heavy, complex instruction work-loads.

The x86 is overloaded with too many instruction sets: MMX, MMX+, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, AVX2, FMA3, SHA.
I guess this is why Intel wanted or is still in design stage of a brand new x86 architecture which will delete all those legacy modes and make the transistors work on modern apps.

#16

lexluthermiester

aQiDoes that apply to RISC vs CISC architecture differences as well ?

That's exactly what I'm talking about. RISC=Reduced Instruction Set Computing. CISC=Complex Instruction Set Computing.
The great thing about RISC is that it's very efficient when the compiled code is properly optimized. When it's not, instructions not properly coded/optimized have to be completed in software instead of hardware, which is MUCH slower. CISC is not as efficient, but most compiled code can run on hardware instead of in software. It's FAR more complicated than this brief explanation, but you get the general idea.

Which ISA standard you choose will depend greatly on what you want your code to do and how fast.

PunkenjoyThe thing is there are very few real RISC cpu. If you look at how large Arm instruction set grow, it's hard to declare that still a reduced instruction set.

That would not be correct. Yes, RISC SOCs are more complex than they were in the past, but so too is CISC. RISC CPU hardware instructions have about doubled in last 20 years. CISC(X86/X64) has increased at least quadruple in the same amount of time.

So while ARM designs and instructions have become more complex, they are still very much "reduced" in comparison to X86/X64 and even PowerPC.

#17

ARF

lexluthermiesterThat's exactly what I'm talking about. RISC=Reduced Instruction Set Computing. CISC=Complex Instruction Set Computing.
The great thing about RISC is that it's very efficient when the compiled code is properly optimized. When it's not, instructions not proper coded/optimized have to be completed in software instead of hardware, which is MUCH slower. CISC is not as efficient, but most compiled code can run on hardware instead of in software. It's FAR more complicated than this brief explanation, but you get the general idea.

Which ISA standard you choose will depend greatly on what you want your code to do and how fast.

That would not be correct. Yes, RISC SOCs are more complex than they were in the past, but so too is CISC. RISC CPU hardware instructions have about doubled in last 20 years. CISC(X86/X64) has increased at least quadruple in the same amount of time.

So while ARM designs and instructions have become more complex, they are still very much "reduced" in comparison to X86/X64 and even PowerPC.

I guess it is easier and faster to pay for good software developers than to AMD or Intel to design good semiconductors.

#18

lexluthermiester

ARFThe x86 is overloaded with too many instruction sets: MMX, MMX+, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, AVX2, FMA3, SHA.

Overloaded is one term. I just call it CISC.

ARFI guess this is why Intel wanted or is still in design stage of a brand new x86 architecture which will delete all those legacy modes and make the transistors work on modern apps.

Never gonna happen. WAY too much software that need legacy ISA support, even modern apps.

#19

londiste

ARFThe x86 is overloaded with too many instruction sets: MMX, MMX+, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, AVX2, FMA3, SHA.

So, that Neoverse N2 core in Grace CPU. It has ARMv9A A32/T32/A64, RAS, SVE, SVE2 (with backwards compatibility to NEON), also stuff like TMR, CCA and MME. This is likely not an exhaustive list. x86 has older and more iterated extensions like the entire MMX/SSE/AVX thread where new ones partially replaced the previous extensions, ARM has had some other extensions for the same goals but they basically standardized it with NEON/SVE/SVE2.

#20

dragontamer5788

ARFThe x86 is overloaded with too many instruction sets: MMX, MMX+, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, AVX2, FMA3, SHA.
I guess this is why Intel wanted or is still in design stage of a brand new x86 architecture which will delete all those legacy modes and make the transistors work on modern apps.

Erm... ARM has a ton of instruction sets too? Or did you forget that ARM literally didn't have "division" or "modulus" and that division/modulus are extensions on top of its original instruction set? ARM literally had to add AES and SHA instructions to their CPUs because guess what? Modern "https" systems use AES / SHA so often that it makes sense to change our CPU-cores to specifically include "SHA" instructions or "AES" instructions.

EDIT: There's a reason why its called "ARMv8" (as well as ARMv8.1, ARMv8.2, ARMv8.3...), because there was ARMv1, ARMv2, ARMv3... ARMv7. And we have to also ignore all the death paths, like Jazelle instructions (aka: Java-instructions for ARM), Thumb-v1, Thumb-v2, etc. etc.

Even the SSE / AVX mistake is being repeated by ARM yet again, because ARM made NEON-instructions (128-bit) when Intel/AMD were working on 256-bit AVX. These NEON instructions are now obsolete as ARM is working on SVE.

Do you even work with ARM instructions? ARM is a CISC processor at this point. Do you know what the ARM "fjcvtzs" instruction does? Do you know the history of this?

--------

RISC vs CISC has played out. CISC won. All RISC-instruction sets are glorified CISC processors with macro-op fusion (ex: aese + aesmc instructions in ARM, merging two instructions to make a macro-op), SIMD-instructions (NEON and its various incarnations), multiple memory models (lol, ARMv7 started with load-consume / store-release, turned out to be an awful memory model so ARMv8 had to introduce a whole slew of new load/store commands called load-acquire / store-release), etc. etc.

CPUs are very hard. They have to constantly change their instruction sets. ARM, x86, etc. etc. The only CPUs that don't change are dead ones (ex: MIPS, may it rest in peace). CPUs are all turd piled up on more turd being used as lipsticks on very ugly pigs. ARM tried to be RISC but has effectively turned into a giant mess of a core, much like x86 has. Everything turns into CISC as time goes on, that's just the nature of this industry.

--------

EDIT: Instruction sets become complicated over time because its really easy to decode instructions compared to everything else the CPU does. CPUs today are super-scalar (multiple-instructions simultaneously executing per clock cycle, as much as 8x instructions per clock on Apple's M1 chip), hyperthreaded (each core works with 2, 4, 8 threads at a time), pipelined (each instruction gets split up into 30+ steps for other bits of the processor to handle), out-of-order (literally executing "later" instructions before "earlier" instructions), cache-coherent snooping (spies on other CPU cores to see their memory reads/writes to automatically adjust their understanding of memory) complicated beasts.

This whole RISC vs CISC thing is a question of how complicated of a decoder you wanna make. But decoders aren't even that big on today's CPUs, because everything else the CPU does is far more complicated, and costly in terms of area/power/RAM/price/silicon. I think I can safely declare RISC vs CISC to be a dead discussion. Today's debate is really CPU vs GPU (or really: single-threaded with a bit of SIMD like x86/ARM/POWER... vs SIMD-primarily like Turing/RDNA2)

#21

ARF

Yeah, I understand everything you wrote but in the end the RISC ARM Qualcomm SM8150 Snapdragon 855 is 5 watts and is as fast as 15 watts Ryzen U.

Qualcomm Snapdragon 855 - Benchmark, Test and specs (cpu-monkey.com)

Qualcomm Snapdragon 855 SoC - Benchmarks and Specs - NotebookCheck.net Tech

#22

dragontamer5788

ARFYeah, I understand everything you wrote but in the end the RISC ARM Qualcomm SM8150 Snapdragon 855 is 5 watts and is as fast as 15 watts Ryzen U.

Are you sure?

Benchmarks are surprisingly inconsistent these days. There's a lot of conflicting reports and conflicting information. My opinion of stock-ARM is pretty low actually. Neoverse looks like they're decent cores, but they're still a bit out of date. ARM from Apple / Fujitsu are world-class processors though.

I'm all for good competition. But there's a reason why the computer industry has continued to use Intel Xeon / AMD EPYC in power-constrained datacenter workloads. Because in practice, AMD EPYC is the most power-efficient system in practice (followed by Intel Xeons as the #2. Very close competition, but AMD has the lead this year).

#23

SunMaster

ARFYeah, I understand everything you wrote but in the end the RISC ARM Qualcomm SM8150 Snapdragon 855 is 5 watts and is as fast as 15 watts Ryzen U.

Qualcomm Snapdragon 855 - Benchmark, Test and specs (cpu-monkey.com)

Qualcomm Snapdragon 855 SoC - Benchmarks and Specs - NotebookCheck.net Tech

The Ryzen 3k series are 12nm 4-core cpus (with HT) - which you are comparing to Qualcomms 8-core 7nm cpu. I don't think that's comparing apples to apples.

#24

dragontamer5788

SunMasterThe Ryzen 3k series are 12nm 4-core cpus (with HT) - which you are comparing to Qualcomms 8-core 7nm cpu. I don't think that's comparing apples to apples.

To be fair though... Qualcomm 855 is a 1-very-big-core + 3-big-core + 4-small core system. I'd personally describe the Qualcomm 855 to be a 4-core system frankly.

But in any case, the process difference (12nm vs 7nm) is pretty big, that's a 50% cut in power IIRC, so that's a valuable point to bring up. Manufacturing differences is the big reason why we techies are talking about nanometers so much...

-------

IIRC, there was something about cell-phones disabling their power-limiters when they detected Geekbench (!!!!), so there's also the lack of apples-to-apples when it comes to benchmarking. Don't trust the specs, something can be a 5W CPU but will disable its power-limiter to draw 10 or 15 watts during a benchmark temporarily. This leads to grossly different performance characteristics when different people run different benchmarks on their own systems.

The only way to get the truth is to hook up wires and measure the power-usage of the CPU (or system) during a benchmark, like what's done here on TPU (or also on Anandtech and other online testing sites). I don't really trust random benchmark numbers on the internet anymore.

#25

MrMilli

Just to put nVidia's SPECrate 2017_int_base score of 740 into perspective:

Power10 120C = 1700 / 2170 (base / peak)

EPYC 7773X 128C = 864 / 928
Xeon 8380H 224C = 1570 / 1620

Ampere Altra 160C = 596

Add your own comment

NVIDIA Claims Grace CPU Superchip is 2X Faster Than Intel Ice Lake

30 Comments on NVIDIA Claims Grace CPU Superchip is 2X Faster Than Intel Ice Lake

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts

NVIDIA Claims Grace CPU Superchip is 2X Faster Than Intel Ice Lake

Related News

30 Comments on NVIDIA Claims Grace CPU Superchip is 2X Faster Than Intel Ice Lake

Latest GPU Drivers

New Forum Posts

Popular Reviews

TPU on YouTube

Controversial News Posts