• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Linux Performance of AMD Rome vs Intel Cascade Lake, 1 Year On

Joined
Oct 2, 2015
Messages
3,152 (0.93/day)
Location
Argentina
System Name Ciel / Akane
Processor AMD Ryzen R5 5600X / Intel Core i3 12100F
Motherboard Asus Tuf Gaming B550 Plus / Biostar H610MHP
Cooling ID-Cooling 224-XT Basic / Stock
Memory 2x 16GB Kingston Fury 3600MHz / 2x 8GB Patriot 3200MHz
Video Card(s) Gainward Ghost RTX 3060 Ti / Dell GTX 1660 SUPER
Storage NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB / NVMe WD Blue SN550 512GB
Display(s) AOC Q27G3XMN / Samsung S22F350
Case Cougar MX410 Mesh-G / Generic
Audio Device(s) Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply Aerocool KCAS-500W / Gigabyte P450B
Mouse EVGA X15 / Logitech G203
Keyboard VSG Alnilam / Dell
Software Windows 11
You add too much latency over PCIe.
 
Joined
Jun 10, 2014
Messages
3,016 (0.78/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
I hope not, very wide SIMD is a fallacy in modern computer architecture design. SIMD was introduced in the days when other massively parallel compute hardware didn't exist and everyone thought frequency/numbers of transistors would just scale forever with increasingly lower power consumption.
The point of SIMD is to do the same logic across a larger vector of data, saving a lot of unnecessary logic.

GPUs make CPU SIMD redundant, I can't think of a single application that couldn't be scaled up from x86 AVX to CUDA/OpenCL, in fact the latter are way more robust anyway.
As I said in post #21, it has to do with overhead.
AVX is like having a tiny "GPU" with practically zero overhead and mixed with other instructions across the execution ports, while an actual GPU is a separate processor that cost you thousands of clock cycles to talk with and have its own memory system. Skipping between the CPU and the GPU every other instruction is never going to be possible, even if the GPU was on-die, there will always be a threshold about work size before it's worth sending something to the GPU. This should be obvious for those who have developed with this technology.
AVX and GPUs are both SIMD, but SIMD at different scales solving different problems.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
You add too much latency over PCIe.

That's inconsequential with many data parallel algorithms, if the data set is non-trivial it wont matter that it took you 50ms or whatever to move a couple of GBs to a GPU if you are going to then iterate over it using thousands of threads, in fact that was the entire philosophy behind GPGPU. These days there is practically no worthwhile data parallel problem that a GPU wouldn't be able to solve faster. If the host-device latency mattered that much, no one would be using GPUs for compute. Just to prove a point I wrote a solver for particular type of linear systems and by the time the data was something like 8-10 MB the GPU version was already faster including the time it took for the data be transferred over, keep in mind that's not even as big as the CPU cache and no one has any use for a linear system that small.

 
Last edited:
Joined
Oct 2, 2015
Messages
3,152 (0.93/day)
Location
Argentina
System Name Ciel / Akane
Processor AMD Ryzen R5 5600X / Intel Core i3 12100F
Motherboard Asus Tuf Gaming B550 Plus / Biostar H610MHP
Cooling ID-Cooling 224-XT Basic / Stock
Memory 2x 16GB Kingston Fury 3600MHz / 2x 8GB Patriot 3200MHz
Video Card(s) Gainward Ghost RTX 3060 Ti / Dell GTX 1660 SUPER
Storage NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB / NVMe WD Blue SN550 512GB
Display(s) AOC Q27G3XMN / Samsung S22F350
Case Cougar MX410 Mesh-G / Generic
Audio Device(s) Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply Aerocool KCAS-500W / Gigabyte P450B
Mouse EVGA X15 / Logitech G203
Keyboard VSG Alnilam / Dell
Software Windows 11
That's inconsequential with many data parallel algorithms, if the data set is non-trivial it wont matter that it took you 50ms or whatever to move a couple of GBs to a GPU if you are going to then iterate over it using thousands of threads, in fact that was the entire philosophy behind GPGPU. These days there is practically no worthwhile data parallel problem that a GPU wouldn't be able to solve faster. If the host-device latency mattered that much, no one would be using GPUs for compute. Just to prove a point I wrote a solver for particular type of linear systems and by the time the data was something like 8-10 MB the GPU version was already faster including the time it took for the data be transferred over, keep in mind that's not even as big as the CPU cache and no one has any use for a linear system that small.

For our use case in yuzu, it would be too much latency, and we are already very bandwidth limited. FMA and AVX2 already boost speed a nice 40%, AVX512 would help a lot, far more than using GPGPU capabilities. But, we could do ASTC decoding via OpenCL/CUDA if desktop GPUs never add support, that would beat any CPU instruction set.
 
Joined
Apr 30, 2011
Messages
2,722 (0.54/day)
Location
Greece
Processor AMD Ryzen 5 5600@80W
Motherboard MSI B550 Tomahawk
Cooling ZALMAN CNPS9X OPTIMA
Memory 2*8GB PATRIOT PVS416G400C9K@3733MT_C16
Video Card(s) Sapphire Radeon RX 6750 XT Pulse 12GB
Storage Sandisk SSD 128GB, Kingston A2000 NVMe 1TB, Samsung F1 1TB, WD Black 10TB
Display(s) AOC 27G2U/BK IPS 144Hz
Case SHARKOON M25-W 7.1 BLACK
Audio Device(s) Realtek 7.1 onboard
Power Supply Seasonic Core GC 500W
Mouse Sharkoon SHARK Force Black
Keyboard Trust GXT280
Software Win 7 Ultimate 64bit/Win 10 pro 64bit/Manjaro Linux
Even FX CPUs on Linux were much closer in performance vs Intel CPU of that era than how they performed in windows. Intel compilers on windows made Intel look better that it was until recently that AMD invested heavily on Ryzen and the software platform around windows.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
For our use case in yuzu, it would be too much latency, and we are already very bandwidth limited. FMA and AVX2 already boost speed a nice 40%, AVX512 would help a lot, far more than using GPGPU capabilities.

See, that's the thing. That means AVX 512 wouldn't help at all, 40% scaling with AVX2 is already pretty bad and it indicates that the limiting factor is not compute but rather memory bandwidth or branching. That's the problem with wider SIMD, it needs more bandwidth which is already scarce.
 
Joined
Oct 2, 2015
Messages
3,152 (0.93/day)
Location
Argentina
System Name Ciel / Akane
Processor AMD Ryzen R5 5600X / Intel Core i3 12100F
Motherboard Asus Tuf Gaming B550 Plus / Biostar H610MHP
Cooling ID-Cooling 224-XT Basic / Stock
Memory 2x 16GB Kingston Fury 3600MHz / 2x 8GB Patriot 3200MHz
Video Card(s) Gainward Ghost RTX 3060 Ti / Dell GTX 1660 SUPER
Storage NVMe Kingston KC3000 2TB + NVMe Toshiba KBG40ZNT256G + HDD WD 4TB / NVMe WD Blue SN550 512GB
Display(s) AOC Q27G3XMN / Samsung S22F350
Case Cougar MX410 Mesh-G / Generic
Audio Device(s) Kingston HyperX Cloud Stinger Core 7.1 Wireless PC
Power Supply Aerocool KCAS-500W / Gigabyte P450B
Mouse EVGA X15 / Logitech G203
Keyboard VSG Alnilam / Dell
Software Windows 11
Don't worry, the industry will "solve it" with 8GHz RAM sticks.
 
Joined
Jan 8, 2017
Messages
9,599 (3.27/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
They wish, even if you get faster memory the bigger the latency becomes, GPUs are immune to that because of the way threads are scheduled. It's a losing battle anyway because the numbers of cores will keep increasing anyway, you'll never have enough bandwidth for super wide SIMD. Companies like Intel will have to accept CPUs should remain CPUs and stop emulating GPUs, actually they probably already had, I'm willing to bet we'll never see anything past 512-bit SIMD for a very, very long time.
 
Last edited:
Top