• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD and Xilinx Announce a New World Record for AI Inference

btarunr

Editor & Senior Moderator
Staff member
Joined
Oct 9, 2007
Messages
47,205 (7.56/day)
Location
Hyderabad, India
System Name RBMK-1000
Processor AMD Ryzen 7 5700G
Motherboard ASUS ROG Strix B450-E Gaming
Cooling DeepCool Gammax L240 V2
Memory 2x 8GB G.Skill Sniper X
Video Card(s) Palit GeForce RTX 2080 SUPER GameRock
Storage Western Digital Black NVMe 512GB
Display(s) BenQ 1440p 60 Hz 27-inch
Case Corsair Carbide 100R
Audio Device(s) ASUS SupremeFX S1220A
Power Supply Cooler Master MWE Gold 650W
Mouse ASUS ROG Strix Impact
Keyboard Gamdias Hermes E2
Software Windows 11 Pro
At today's Xilinx Developer Forum in San Jose, Calif., our CEO, Victor Peng was joined by the AMD CTO Mark Papermaster for a Guinness. But not the kind that comes in a pint - the kind that comes in a record book. The companies revealed the AMD and Xilinx have been jointly working to connect AMD EPYC CPUs and the new Xilinx Alveo line of acceleration cards for high-performance, real-time AI inference processing. To back it up, they revealed a world-record 30,000 images per-second inference throughput!

The impressive system, which will be featured in the Alveo ecosystem zone at XDF today, leverages two AMD EPYC 7551 server CPUs with its industry-leading PCIe connectivity, along with eight of the freshly-announced Xilinx Alveo U250 acceleration cards. The inference performance is powered by Xilinx ML Suite, which allows developers to optimize and deploy accelerated inference and supports numerous machine learning frameworks such as TensorFlow. The benchmark was performed on GoogLeNet, a widely used convolutional neural network.



AMD and Xilinx have shared a common vision around the evolution of computing to heterogeneous system architecture and have a long history of technical collaboration. Both companies have optimized drivers and tuned the performance for interoperability between AMD EPYC CPUs with Xilinx FPGAs. We are also collaborating with others in the industry on cache coherent interconnect for accelerators (the CCIX Consortium - pronounced "see-six"), focused on enabling cache coherency and shared memory across multiple processors.

AMD EPYC is the perfect CPU platform for accelerating artificial intelligence and high- performance computing workloads. With 32 cores, 64 threads, 8 memory channels with up to 2 TB of memory per socket, and 128 PCIe lanes coupled with the industry's first hardware-embedded x86 server security solution, EPYC is designed to deliver the memory capacity, bandwidth, and processor cores to efficiently run memory-intensive workloads commonly seen with AI and HPC. With EPYC, customers can collect and analyze larger data sets much faster, helping them significantly accelerate complex problems.

Xilinx and AMD see a bright future in their technology collaboration. There is strong alignment in our roadmaps that align the high-performance AMD EPYC server and graphics processors with Xilinx acceleration platforms across its Alveo accelerator cards, as well as its forthcoming Versal portfolio.

So, raise a pint to the future of AI inference and innovation for heterogeneous computing platforms. And don't forget to stop by and see the system in action in the Alveo ecosystem zone at the Fairmont hotel.

View at TechPowerUp Main Site
 

FordGT90Concept

"I go fast!1!11!1!"
Joined
Oct 13, 2008
Messages
26,259 (4.47/day)
Location
IA, USA
System Name BY-2021
Processor AMD Ryzen 7 5800X (65w eco profile)
Motherboard MSI B550 Gaming Plus
Cooling Scythe Mugen (rev 5)
Memory 2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s) AMD Radeon RX 7900 XT
Storage Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s) Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s) Realtek ALC1150, Micca OriGen+
Power Supply Enermax Platimax 850w
Mouse Nixeus REVEL-X
Keyboard Tesoro Excalibur
Software Windows 10 Home 64-bit
Benchmark Scores Faster than the tortoise; slower than the hare.
Joined
Apr 30, 2012
Messages
3,881 (0.85/day)
Those are the accelerators

Alveo U250 Data Center Accelerator Card

At the heart of the Xilinx Alveo U200 and U250 accelerator cards are custom-built UltraScale+ FPGAs that run or optimally (and exclusively) on Alveo .
 
Last edited:

the54thvoid

Super Intoxicated Moderator
Staff member
Joined
Dec 14, 2009
Messages
13,037 (2.39/day)
Location
Glasgow - home of formal profanity
Processor Ryzen 7800X3D
Motherboard MSI MAG Mortar B650 (wifi)
Cooling be quiet! Dark Rock Pro 4
Memory 32GB Kingston Fury
Video Card(s) Gainward RTX4070ti
Storage Seagate FireCuda 530 M.2 1TB / Samsumg 960 Pro M.2 512Gb
Display(s) LG 32" 165Hz 1440p GSYNC
Case Asus Prime AP201
Audio Device(s) On Board
Power Supply be quiet! Pure POwer M12 850w Gold (ATX3.0)
Software W10
I don't think they are Vega based. There are a lot of FPGA cards out there, my brother (and a team) designed one last year. Stratix (developed by Altera) is the chip inside his companies GPU.
 
Joined
Jun 28, 2016
Messages
3,595 (1.17/day)
So are those Vega-based chips?
Why would it be?
It's FPGA-based, so the chip is designed precisely for inference. It's a few times faster than a GPU would be (4x faster than V100, graph below).

It's important to state that the actual Xilinx product is the Alveo accelerator and it's performing the inference tasks. CPUs are here just to run the platform and push data around. It might as well be using Xeons.
It was most likely AMD's initiative to be mentioned here.
It may have been an easy decision for Xilinx, since their main competitor is Intel as well (Stratix-based accelerator was launched just a few days ago).

As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.
1538549704654.png

https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

And now the fun part
Arria-10 is a competing FPGA product... but not the latest one from Altera/Intel. Xilinx says:
"3. Arria-10 numbers taken Intel White Paper, "Accelerating Deep Learning with the OpenCL™ Platform and Intel Stratix 10 FPGAs."
https://builders.intel.com/docs/aib...pencl-platform-and-intel-stratix-10-fpgas.pdf."

But you know what's also in this white paper? Surprise... it's Stratix 10 performance! :-D
1538550531654.png
 
Last edited:
Joined
Jun 28, 2016
Messages
3,595 (1.17/day)
Maybe for InfiniBand? It would be quite useful for a cluster with several of these computers - and also mean that they could break this record again in the future, if the algorithms parallelize and scale well enough.
Fairly unlikely, since inference is latency-critical, i.e. it's usually important to get the result quickly (<> fast).
Also, inference is not exactly a material for parallelism.
Multi-thread gain here is mostly done by batching, i.e. you're performing calculations on many samples at the same time.
 
Joined
Dec 27, 2013
Messages
887 (0.22/day)
Location
somewhere
Why would it be?
It's FPGA-based, so the chip is designed precisely for inference. It's a few times faster than a GPU would be (4x faster than V100, graph below).

It's important to state that the actual Xilinx product is the Alveo accelerator and it's performing the inference tasks. CPUs are here just to run the platform and push data around. It might as well be using Xeons.
It was most likely AMD's initiative to be mentioned here.
It may have been an easy decision for Xilinx, since their main competitor is Intel as well (Stratix-based accelerator was launched just a few days ago).

As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.
View attachment 107960
https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

And now the fun part
Arria-10 is a competing FPGA product... but not the latest one from Altera/Intel. Xilinx says:
"3. Arria-10 numbers taken Intel White Paper, "Accelerating Deep Learning with the OpenCL™ Platform and Intel Stratix 10 FPGAs."
https://builders.intel.com/docs/aib...pencl-platform-and-intel-stratix-10-fpgas.pdf."

But you know what's also in this white paper? Surprise... it's Stratix 10 performance! :-D
View attachment 107961
Xeons have less on chip I/O for these cards so EPYC are naturally better suited for these configurations. I apologise if i'm just being overly sensitive but it always seems like people have discredit AMD at any given opportunity.
 

breubreubreu

New Member
Joined
Oct 3, 2018
Messages
2 (0.00/day)
Fairly unlikely, since inference is latency-critical, i.e. it's usually important to get the result quickly (<> fast).
Also, inference is not exactly a material for parallelism.
Multi-thread gain here is mostly done by batching, i.e. you're performing calculations on many samples at the same time.

Derp, mixed up the inference with the training.

Maybe these ports can be used to access the cards in a "cluster" if there aren't enough PCIe slots? Even this is a stretch, though.
 
Joined
Jan 17, 2018
Messages
64 (0.03/day)
Why would it be?
It's FPGA-based, so the chip is designed precisely for inference. It's a few times faster than a GPU would be (4x faster than V100, graph below).

It's important to state that the actual Xilinx product is the Alveo accelerator and it's performing the inference tasks. CPUs are here just to run the platform and push data around. It might as well be using Xeons.
It was most likely AMD's initiative to be mentioned here.
It may have been an easy decision for Xilinx, since their main competitor is Intel as well (Stratix-based accelerator was launched just a few days ago).

As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.
View attachment 107960
https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf

And now the fun part
Arria-10 is a competing FPGA product... but not the latest one from Altera/Intel. Xilinx says:
"3. Arria-10 numbers taken Intel White Paper, "Accelerating Deep Learning with the OpenCL™ Platform and Intel Stratix 10 FPGAs."
https://builders.intel.com/docs/aib...pencl-platform-and-intel-stratix-10-fpgas.pdf."

But you know what's also in this white paper? Surprise... it's Stratix 10 performance! :-D
View attachment 107961

It amazes me the lengths people will go to in order to be a fanboy. Stratix 10 isn't released, and won't be until sometime next year. That graph is either a projection, or based on an engineering sample. To be taken with a large grain of salt at this point. Especially with no system or real world tests.

Ordering Information
Ordering Contact Engineering Sample Contact an Intel® sales representative

OEM Partner Server Model Hewlett Packard Enterprise (HPE) Available 1H 2019
 
Joined
Jul 16, 2014
Messages
8,197 (2.17/day)
Location
SE Michigan
System Name Dumbass
Processor AMD Ryzen 7800X3D
Motherboard ASUS TUF gaming B650
Cooling Artic Liquid Freezer 2 - 420mm
Memory G.Skill Sniper 32gb DDR5 6000
Video Card(s) GreenTeam 4070 ti super 16gb
Storage Samsung EVO 500gb & 1Tb, 2tb HDD, 500gb WD Black
Display(s) 1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s) onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply Corsair HX1000i
Mouse Steeseries Esports Wireless
Keyboard Corsair K100
Software windows 10 H
Benchmark Scores https://i.imgur.com/aoz3vWY.jpg?2
Xeons have less on chip I/O for these cards so EPYC are naturally better suited for these configurations. I apologise if i'm just being overly sensitive but it always seems like people have discredit AMD at any given opportunity.
no need to apologize, he is a known shill, a step above fanboi.
 
Last edited:
Joined
Feb 3, 2017
Messages
3,742 (1.32/day)
Processor Ryzen 7800X3D
Motherboard ROG STRIX B650E-F GAMING WIFI
Memory 2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s) INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage 2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s) 42" LG C2 OLED, 27" ASUS PG279Q
Case Thermaltake Core P5
Power Supply Fractal Design Ion+ Platinum 760W
Mouse Corsair Dark Core RGB Pro SE
Keyboard Corsair K100 RGB
VR HMD HTC Vive Cosmos
As for performance, here's a comparison to some alternatives from a Xilinx whitepaper.
View attachment 107960
Where is MI25 in comparison? It is sold with a good reason for inferencing.
Also, have to wonder about the FP16/FP32 note on V100. Wasn't V100 capable of INT8 inferencing? :)
 
Joined
Mar 10, 2014
Messages
1,793 (0.46/day)
Hmh I'm kind of baffled by this. That Alveo U250 ain't that good on the paper, 33.3 TOPS int8 peak with 225W TDP, which is low compared to nvidia. Is that GoogleNET V1 batch=1 some kind of corner case for nvidia's solutions?

Where is MI25 in comparison? It is sold with a good reason for inferencing.
Also, have to wonder about the FP16/FP32 note on V100. Wasn't V100 capable of INT8 inferencing? :)

It's taken from nvidia's marketing materials, inferencing at int8 ain't V100s targeted use. They have smaller teslas for that p4, t4. Would be interesting to see if that GoogleNET can take advance of T4 130 int8 Tensor TOPS.
 

HTC

Joined
Apr 1, 2008
Messages
4,664 (0.77/day)
Location
Portugal
System Name HTC's System
Processor Ryzen 5 5800X3D
Motherboard Asrock Taichi X370
Cooling NH-C14, with the AM4 mounting kit
Memory G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s) Sapphire Pulse 6600 8 GB
Storage 1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s) LG 27UD58
Case Fractal Design Define R6 USB-C
Audio Device(s) Onboard
Power Supply Corsair TX 850M 80+ Gold
Mouse Razer Deathadder Elite
Software Ubuntu 20.04.6 LTS
Hmh I'm kind of baffled by this. That Alveo U250 ain't that good on the paper, 33.3 TOPS int8 peak with 225W TDP, which is low compared to nvidia. Is that GoogleNET V1 batch=1 some kind of corner case for nvidia's solutions?

I know about zero concerning these types of cards.

With that disclaimer out of the way, perhaps the application these cards run manages to take full advantage of the card's capabilities while that doesn't happen with nVidia cards?
 
Joined
Mar 10, 2014
Messages
1,793 (0.46/day)
I know about zero concerning these types of cards.

With that disclaimer out of the way, perhaps the application these cards run manages to take full advantage of the card's capabilities while that doesn't happen with nVidia cards?

Well yeah I don't know. In paper Tesla P4 has 22 int8 TOPS vs U250 33.3 but the latter is almost five times faster on that use case.
 

HTC

Joined
Apr 1, 2008
Messages
4,664 (0.77/day)
Location
Portugal
System Name HTC's System
Processor Ryzen 5 5800X3D
Motherboard Asrock Taichi X370
Cooling NH-C14, with the AM4 mounting kit
Memory G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s) Sapphire Pulse 6600 8 GB
Storage 1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s) LG 27UD58
Case Fractal Design Define R6 USB-C
Audio Device(s) Onboard
Power Supply Corsair TX 850M 80+ Gold
Mouse Razer Deathadder Elite
Software Ubuntu 20.04.6 LTS
Well yeah I don't know. In paper Tesla P4 has 22 int8 TOPS vs U250 33.3 but the latter is almost five times faster on that use case.

5 times actually faster VS roughly 50% faster on paper? Definitely something with nVidia's card hindering performance. That or the Int8 TOPS performance nVidia claims for this card is far higher then what it actually is, no?
 

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.14/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
I know about zero concerning these types of cards.

With that disclaimer out of the way, perhaps the application these cards run manages to take full advantage of the card's capabilities while that doesn't happen with nVidia cards?

It's an FPGA these are getting programmed to do a specific task. You aren't seeing that with an nv/amd card,which are more general purpose.

These cards are out wrecking shop in the mining world as well posting huge numbers. I don't know who said the stratix 10 isn't out I have actually like held one in my hands and stuff a couple months back almost purchased a set, but it requires a much higher level of programming to set up than I was willing to put in.
 

HTC

Joined
Apr 1, 2008
Messages
4,664 (0.77/day)
Location
Portugal
System Name HTC's System
Processor Ryzen 5 5800X3D
Motherboard Asrock Taichi X370
Cooling NH-C14, with the AM4 mounting kit
Memory G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s) Sapphire Pulse 6600 8 GB
Storage 1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s) LG 27UD58
Case Fractal Design Define R6 USB-C
Audio Device(s) Onboard
Power Supply Corsair TX 850M 80+ Gold
Mouse Razer Deathadder Elite
Software Ubuntu 20.04.6 LTS
It's an FPGA these are getting programmed to do a specific task. You aren't seeing that with an nv/amd card,which are more general purpose.

These cards are out wrecking shop in the mining world as well posting huge numbers. I don't know who said the stratix 10 isn't out I have actually like held one in my hands and stuff a couple months back almost purchased a set, but it requires a much higher level of programming to set up than I was willing to put in.

Sort of like consoles, right?

Still, since on paper the difference is about 50% VS 500% actual difference ... that's a whole order of magnitude there ... something's not right, right?
 

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.14/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
Sort of like consoles, right?

Still, since on paper the difference is about 50% VS 500% actual difference ... that's a whole order of magnitude there ... something's not right, right?

I guess you could compare to consoles in a way. Specific targeted coding for one specific item allows someone to fully take advantage of a product.
 
  • Like
Reactions: HTC

HTC

Joined
Apr 1, 2008
Messages
4,664 (0.77/day)
Location
Portugal
System Name HTC's System
Processor Ryzen 5 5800X3D
Motherboard Asrock Taichi X370
Cooling NH-C14, with the AM4 mounting kit
Memory G.Skill Kit 16GB DDR4 F4 - 3200 C16D - 16 GTZB
Video Card(s) Sapphire Pulse 6600 8 GB
Storage 1 Samsung NVMe 960 EVO 250 GB + 1 3.5" Seagate IronWolf Pro 6TB 7200RPM 256MB SATA III
Display(s) LG 27UD58
Case Fractal Design Define R6 USB-C
Audio Device(s) Onboard
Power Supply Corsair TX 850M 80+ Gold
Mouse Razer Deathadder Elite
Software Ubuntu 20.04.6 LTS
I guess you could compare to consoles in a way. Specific targeted coding for one specific item allows someone to fully take advantage of a product.

Still, that difference ... what could be the cause of such a massive difference between advertised and actual?
 

cdawall

where the hell are my stars
Joined
Jul 23, 2006
Messages
27,680 (4.14/day)
Location
Houston
System Name All the cores
Processor 2990WX
Motherboard Asrock X399M
Cooling CPU-XSPC RayStorm Neo, 2x240mm+360mm, D5PWM+140mL, GPU-2x360mm, 2xbyski, D4+D5+100mL
Memory 4x16GB G.Skill 3600
Video Card(s) (2) EVGA SC BLACK 1080Ti's
Storage 2x Samsung SM951 512GB, Samsung PM961 512GB
Display(s) Dell UP2414Q 3840X2160@60hz
Case Caselabs Mercury S5+pedestal
Audio Device(s) Fischer HA-02->Fischer FA-002W High edition/FA-003/Jubilate/FA-011 depending on my mood
Power Supply Seasonic Prime 1200w
Mouse Thermaltake Theron, Steam controller
Keyboard Keychron K8
Software W10P
Still, that difference ... what could be the cause of such a massive difference between advertised and actual?

Ask AMD.
 
Joined
Jan 17, 2018
Messages
64 (0.03/day)
Of course it is. Stratix 10 has been around for a while, in multiple variants.
https://www.tomshardware.co.uk/intel-stratix-10-tx-fpga,news-57969.html

What you're thinking about is the latest Intel-built accelerator.
No, at least not the ones Intel has been plugging lately. They have 3 versions, the HBM version that they've been plugging since last year has yet to be released. None of the charts specify which card they show, but every piece of marketing I've seen has been about the HBM version. Which I belive is the MX?

It's an FPGA these are getting programmed to do a specific task. You aren't seeing that with an nv/amd card,which are more general purpose.

These cards are out wrecking shop in the mining world as well posting huge numbers. I don't know who said the stratix 10 isn't out I have actually like held one in my hands and stuff a couple months back almost purchased a set, but it requires a much higher level of programming to set up than I was willing to put in.
My bad, I didn't specify. The version they've been plugging since last year, the one with HBM, isn’t out from everything I read. Since Intel didn't say which one it benchmarked, I assumed it was the latest and greatest, esp since the chips are identical, except memory and some support for specific applications. They've been crowing about the HBM version for almost a year. If it is released to customers, I stand corrected.

BTW, not to knock the white paper, but if it actually is night and day faster, how could the record have been broken? From the whitepaper, Stratix should have it with half the cards in anybody's system.
 
Joined
Oct 27, 2009
Messages
1,179 (0.21/day)
Location
Republic of Texas
System Name [H]arbringer
Processor 4x 61XX ES @3.5Ghz (48cores)
Motherboard SM GL
Cooling 3x xspc rx360, rx240, 4x DT G34 snipers, D5 pump.
Memory 16x gskill DDR3 1600 cas6 2gb
Video Card(s) blah bigadv folder no gfx needed
Storage 32GB Sammy SSD
Display(s) headless
Case Xigmatek Elysium (whats left of it)
Audio Device(s) yawn
Power Supply Antec 1200w HCP
Software Ubuntu 10.10
Benchmark Scores http://valid.canardpc.com/show_oc.php?id=1780855 http://www.hwbot.org/submission/2158678 http://ww
Xeons have less on chip I/O for these cards so EPYC are naturally better suited for these configurations. I apologise if i'm just being overly sensitive but it always seems like people have discredit AMD at any given opportunity.

AMD does have more raw lanes, but without pcie switches to buffer between cpu and accelerator they easily get bottlenecked by the x16 gmi lanes between dies.
130ns between dies on the same socket, and 250ns between dies on opposing sockets... so card 1 to card 2 is 130ns... card 1 to card 4 is 250ns, and card 2 to card 4 is 380ns... So long as things don't have to talk to each other, and you aren't hanging nvme as well... you can survive without a switch, otherwise you will quickly saturate the internal gmi and xgmi interconnects. Those are idle latencies btw...

Rome should solve most of this by straight up doubling the pcie lanes, bumping the ram frequency and making a seperate interconnect for accelerators.

Epyc is very competitive, but not without weakness.
 
Top