• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Patents Chiplet Architecture for Radeon GPUs

Joined
Feb 3, 2017
Messages
3,879 (1.33/day)
Processor Ryzen 7800X3D
Motherboard ROG STRIX B650E-F GAMING WIFI
Memory 2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s) INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage 2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s) 42" LG C2 OLED, 27" ASUS PG279Q
Case Thermaltake Core P5
Power Supply Fractal Design Ion+ Platinum 760W
Mouse Corsair Dark Core RGB Pro SE
Keyboard Corsair K100 RGB
VR HMD HTC Vive Cosmos
It does look like that's the case but it's still strange that they chose to do it this way. This means the chiplet connected to the bus has to do a lot of extra scheduling and since I imagine they'd all have to be identical this seems rather wasteful and could generate a lot of overheard.
Why does the first chiplet need to do scheduling? First die can do communication and maybe some basic scheduling, leaving rest to schedulers on the chiplets. Didn't GPU in Xbox One have multiple GCPs?
 
Joined
Jan 8, 2017
Messages
9,627 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Why does the first chiplet need to do scheduling? First die can do communication and maybe some basic scheduling, leaving rest to schedulers on the chiplets. Didn't GPU in Xbox One have multiple GCPs?

If by GCP you mean the "Graphics Command Processor" then that's a no, that's a single central block which interprets and schedules onto the next blocks what it receives off the PCIe bus. With a monolithic GPU there is only one of those and there are no problems. The question is how is this handled in an MCM design and there are really only two ways : each chiplet uses it's own GCP independently which implies they all have access to the bus or use the GCP of just the first chiplet but then that puts pressure only on the scheduling logic of that first chiplet.

Having independent GPCs made a lot more sense but it was probably more complicated on the host side to distribute commands.
 
Last edited:
Joined
May 2, 2017
Messages
7,762 (2.75/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
If by GCP you mean the "Graphics Command Processor" then that's a no, that's a single central block which interprets and schedules onto the next blocks what it receives off the PCIe bus. With a monolithic GPU there is only one of those and there are no problems. The question is how is this handled in an MCM design and there are really only two ways : each chiplet uses it's own GCP independently which implies they all have access to the bus or use the GCP of just the first chiplet but then that puts pressure only on the scheduling logic of that first chiplet.

Having independent GPCs made a lot more sense but it was probably more complicated on the host side to distribute commands.
Given the use of passive interconnects I am reading this as an attempt to make this work as close to a monolithic GPU as possible, likely precisely to avoid the timing issues of active interconnects and multiple schedulers. As long as the latency of the interconnects is low enough that should be much simpler to do than distributed scheduling, after all. If I read the patent right the LLC is coherent across all chiplets, which should make such an approach relatively easy (all relevant data in cache everywhere, so any instruction can be executed anywhere) at the cost of die area for that cache and the power cost of sending that data out to more than one chiplet.
 
Joined
Dec 26, 2006
Messages
3,907 (0.59/day)
Location
Northern Ontario Canada
Processor Ryzen 5700x
Motherboard Gigabyte X570S Aero G R1.1 BiosF5g
Cooling Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s) AMD RX 6800 - Asus Tuf
Storage Kingston KC3000 1TB & 2TB & 4TB Corsair MP600 Pro LPX
Display(s) LG 27UL550-W (27" 4k)
Case Be Quiet Pure Base 600 (no window)
Audio Device(s) Realtek ALC1220-VB
Power Supply SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse Mionix Naos Pro
Keyboard Corsair Strafe with browns
Software W10 22H2 Pro x64
Cool stuff.

I remember the 4870 X2 where it was 2 chips on 1 pcb connected with a PLX chip or something?
 
Joined
Sep 28, 2012
Messages
983 (0.22/day)
System Name Poor Man's PC
Processor Ryzen 7 9800X3D
Motherboard MSI B650M Mortar WiFi
Cooling Thermalright Phantom Spirit 120 with Arctic P12 Max fan
Memory 32GB GSkill Flare X5 DDR5 6000Mhz
Video Card(s) XFX Merc 310 Radeon RX 7900 XT
Storage XPG Gammix S70 Blade 2TB + 8 TB WD Ultrastar DC HC320
Display(s) Xiaomi G Pro 27i MiniLED
Case Asus A21 Case
Audio Device(s) MPow Air Wireless + Mi Soundbar
Power Supply Enermax Revolution DF 650W Gold
Mouse Logitech MX Anywhere 3
Keyboard Logitech Pro X + Kailh box heavy pale blue switch + Durock stabilizers
VR HMD Meta Quest 2
Benchmark Scores Who need bench when everything already fast?
Basically Crossfire on same die. FP32 ray tracing gonna benefit from this, but I doubt that gonna impact traditional raster. I see AMD added more of fixed function, and partially offload FMA to CPU as well.
 
Joined
Feb 11, 2009
Messages
5,642 (0.97/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
Cool stuff.

I remember the 4870 X2 where it was 2 chips on 1 pcb connected with a PLX chip or something?

and the 3870 X2 before it but that was basically just crossfire on a single card, like the 7950 GX2....well that was actually two cards.... ok the 9800 GX2...wait that was also really 2 cards....whatever, chiplet design is different!
 
Joined
Jul 16, 2014
Messages
8,224 (2.14/day)
Location
SE Michigan
System Name Dumbass
Processor AMD Ryzen 7800X3D
Motherboard ASUS TUF gaming B650
Cooling Artic Liquid Freezer 2 - 420mm
Memory G.Skill Sniper 32gb DDR5 6000
Video Card(s) GreenTeam 4070 ti super 16gb
Storage Samsung EVO 500gb & 1Tb, 2tb HDD, 500gb WD Black
Display(s) 1x Nixeus NX_EDG27, 2x Dell S2440L (16:9)
Case Phanteks Enthoo Primo w/8 140mm SP Fans
Audio Device(s) onboard (realtek?) - SPKRS:Logitech Z623 200w 2.1
Power Supply Corsair HX1000i
Mouse Steeseries Esports Wireless
Keyboard Corsair K100
Software windows 10 H
Benchmark Scores https://i.imgur.com/aoz3vWY.jpg?2
It has been longer. 3dfx's SLI, some (rare) dual-GPU Voodoo2 cards and their planned multi-GPU Voodoo4/5 cards were a thing since 1998 or so. With the changes in GPUs and rendering methods that naive of an implementation no longer worked. The holy grail for gaming is still an MCM GPU that would appears as single GPU to software or API.

MCM was specifically stated as a possible goal with R600 and small chips strategy. It didn't pan out too well (for any manufacturer so far).
ahh I'd forgotten about 3dfx, I had one of their cards once up a time. Matrox tried to do something with them, I think.
 
Joined
Jul 13, 2016
Messages
3,437 (1.10/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) JDS Element IV, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse PMM P-305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
There just wasn't a need for them. The only proper use of chiplets is when you've exhausted every other trick in the book and you simply can't make a faster chip without colossal costs, that's why AMD debuted chiplets in CPUs with their server products first because it was the only way to beat Intel in a cost effective fashion.

Chiplets aren't just cost effective though. The university of toranto demonstrated that with an active interposer and the right network topology on that interposer, latencies of a chiplet based CPU can be better than that of a monolithic design, especially as core counts increase.

Chiplet based CPUs are also far better for binning. Compared to a monolithic design where a single defect can reduce the value of the entire die, with chiplets you can just bin the best dies for the best CPUs. With chiplets you also get to pick which specific dies to use in each specific CPU model. This is what allows AMD to make CPUs like the 5950X. They are cherry picking the best dies to achieve a lower power consumption and higher performance.

Other reasons also include cost and yield. Suffice it to say, implying that AMD adopted chiplets only because it was cost effective is misleading. I think it's pretty clear AMD knew this design's other advantages from the onset.
 
Joined
Feb 3, 2017
Messages
3,879 (1.33/day)
Processor Ryzen 7800X3D
Motherboard ROG STRIX B650E-F GAMING WIFI
Memory 2x16GB G.Skill Flare X5 DDR5-6000 CL36 (F5-6000J3636F16GX2-FX5)
Video Card(s) INNO3D GeForce RTX™ 4070 Ti SUPER TWIN X2
Storage 2TB Samsung 980 PRO, 4TB WD Black SN850X
Display(s) 42" LG C2 OLED, 27" ASUS PG279Q
Case Thermaltake Core P5
Power Supply Fractal Design Ion+ Platinum 760W
Mouse Corsair Dark Core RGB Pro SE
Keyboard Corsair K100 RGB
VR HMD HTC Vive Cosmos
Chiplets aren't just cost effective though. The university of toranto demonstrated that with an active interposer and the right network topology on that interposer, latencies of a chiplet based CPU can be better than that of a monolithic design, especially as core counts increase.
I seriously doubt it. Do you have a link to the paper?
 
Joined
May 2, 2017
Messages
7,762 (2.75/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
Basically Crossfire on same die. FP32 ray tracing gonna benefit from this, but I doubt that gonna impact traditional raster. I see AMD added more of fixed function, and partially offload FMA to CPU as well.
Doesn't sound like CF. This seems to present itself to the system as a single GPU, in other words it won't look like anything special to games, hopefully requiring no tuning beyond drivers. CF support was highly reliant on developer support, which is of course why it (and SLI) died. If they're actually implementing this for consumer use it would need to be a low-to-no-effort solution.
 
Joined
Jul 13, 2016
Messages
3,437 (1.10/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) JDS Element IV, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse PMM P-305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
Joined
Apr 24, 2020
Messages
2,797 (1.61/day)
What do you mean it's not global ? You do know monolithic GPUs already have multiple memory controllers that serve different CUs, right ? If you divide the chip nothing changes, you still have multiple controllers to feed multiple CUs in each chiplet.

Point of the patent ? I don't know, go ask their lawyers. Apple patented a rectangle with rounded edges so I don't think there is much point in wondering why something gets patented.

Erm... no. We've discussed this before Vya, but your understanding of memory controllers on GPUs is wrong. All the memory controllers on Vega and RDNA / RDNA2 serve a set of L2 caches, and those L2 caches then serve all CUs.

Memory is global: all CUs on a GPU (like MI100 or RDNA2) have the potential to access any memory location. Now, L2 caches are pretty smart: the L2 cache could be pulling from PCIe 4.0 (aka: CPU RAM), and the CUs are none-the-wiser. But keeping that data in-sync across a GPU is very important within a ROCm or OpenCL kernel. Heck, even across kernels, you may share some VRAM and keeping those views consistent requires a bit of effort. (Mostly because L1 caches aren't very coherent on GPUs... so you need to worry about keeping Constant-L1 vs __shared__ vs Vector-L1 in sync).

On AMD architectures at least, the memory model seems to be coherency in the L2 cache. But if you split things up between chiplets, you're going to have separate L2 caches, and things are going to get a LOT more complicated. I can't think of any automated way of keeping separate L2 caches in sync: the programmer simply has to adapt to the new architecture. Maybe some "easy cases" can be detected (two kernels touching two different memory sets independently)... but its non-obvious to me how that kind of system would work.

Alternatively: AMD could keep L2 caches coherent by creating a very large-and-bulky communication mechanism between L2 caches (a "GPU Infinity Fabric"). Which is what this patent appears to be doing with that HBX-crosslink.

1609804674940.png


Hmmm... L3 cache? That's different. RDNA is L0, L1, and L2 (where L1 and L2 are in the proper place in the above diagram). L3 is new.

-------

I don't quite remember the last time we talked about this issue. But I'll bring up the same "Framebuffer" issue as before: the "Framebuffer" is the final rendering of any image on the GPU: the sum of transparency / fog / rasterization / raytracing / all results right before it enters the monitor. The Framebuffer is a hard-hit portion of RAM which must be shared and even has some order-dependent effects going on (transparency is strongly ordered: A (80% alpha) on top of B (50% alpha) on top of C will result in a grossly different picture of C on top of B on top of A). That's the kind of memory sync and memory sharing that happens every render of every frame in every video game.
 
Last edited:
Joined
Apr 19, 2013
Messages
296 (0.07/day)
System Name Darkside
Processor R7 3700X
Motherboard Aorus Elite X570
Cooling Deepcool Gammaxx l240
Memory Thermaltake Toughram DDR4 3600MHz CL18
Video Card(s) Gigabyte RX Vega 64 Gaming OC
Storage ADATA & WD 500GB NVME PCIe 3.0, many WD Black 1-3TB HD
Display(s) Samsung C27JG5x
Case Thermaltake Level 20 XL
Audio Device(s) iFi xDSD / micro iTube2 / micro iCAN SE
Power Supply EVGA 750W G2
Mouse Corsair M65
Keyboard Corsair K70 LUX RGB
Benchmark Scores Not sure, don't care
I remember years ago when I put forth a very similar concept and I got laughed out of the thread. Glad to see AMD trying to implement what I assumed was a logical path forward, and given their track record over the past 4 years I believe they can make this work. Exciting times! :cool:
 
Joined
Jan 8, 2017
Messages
9,627 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
On AMD architectures at least, the memory model seems to be coherency in the L2 cache. But if you split things up between chiplets, you're going to have separate L2 caches, and things are going to get a LOT more complicated. I can't think of any automated way of keeping separate L2 caches in sync: the programmer simply has to adapt to the new architecture. Maybe some "easy cases" can be detected (two kernels touching two different memory sets independently)... but its non-obvious to me how that kind of system would work.

Alternatively: AMD could keep L2 caches coherent by creating a very large-and-bulky communication mechanism between L2 caches (a "GPU Infinity Fabric"). Which is what this patent appears to be doing with that HBX-crosslink.

No, I think you do not understand that there is no limitation here. When exactly do you need coherency across chiplets for L2 caches ? In which scenario are two wavefronts from separate chiplets going to need to, for example, write to the exact same address that is cached in each L2 cache. Scratch that, how would both of those caches end up with a copy of that data in the first place ? Shaders simply do not generate conflicts like those, they're SPMD, that's why they run so fast and hide latencies so well. If that wasn't the case, the performance hit of having CUs needing to read/write from shared memory in each and every other CU would be catastrophic.

I don't quite remember the last time we talked about this issue. But I'll bring up the same "Framebuffer" issue as before: the "Framebuffer" is the final rendering of any image on the GPU: the sum of transparency / fog / rasterization / raytracing / all results right before it enters the monitor. The Framebuffer is a hard-hit portion of RAM which must be shared and even has some order-dependent effects going on (transparency is strongly ordered: A (80% alpha) on top of B (50% alpha) on top of C will result in a grossly different picture of C on top of B on top of A). That's the kind of memory sync and memory sharing that happens every render of every frame in every video game.

I still can't see how that has anything to do with this, the order in which A B and C are applied matters but how each stage is written into memory isn't because it's data independent anyway.
 
Joined
Apr 24, 2020
Messages
2,797 (1.61/day)
When exactly do you need coherency across chiplets for L2 caches ? In which scenario are two wavefronts from separate chiplets going to need to, for example, write to the exact same address that is cached in each L2 cache.

Pretty much every stage of the modern GPU pipeline has some major synchronization step.

Geometry shaders will read and write all the data in parallel. Then when the vertex shaders run after those, they'll all read and write the data together. Finally, pixel shaders process the data all in parallel and eventually deposit their results into the framebuffer.

Yeah I'm skipping a few steps, but that's the gist of it. A modern GPU shares a lot of information in parallel. At a minimum, all of those shaders are performing shared reads, and there needs to be a synchronization point where all of that data is merged together (especially because any transparency algorithm must be Z-order sorted and then applied from furthest-away to closest ala the Painter's algorithm).

There's only one framebuffer (maybe two if you're playing VR for your two eyes: but those two framebuffers must remain in sync regardless). Any geometry shader which tessellates... any vertex shader custom-animation thing... any shared data at all must be synchronized across all compute units. There's only one mipmapped copy of Gordon Freeman's face in VRAM, and that copy is shared to all wavefronts that may want to render any piece of Gordon Freeman's face (texture, vertex, tessellation data).

-------

There's a reason why Crossfire and SLI failed: the data synchronization issue is actually pretty difficult to solve. There was a time when we thought that simple data-sharing patterns was good enough... but as effects got more complicated (more transparency, more sharing of shadows, more complicated tessellation patterns or whatever), the more and more important data-sharing became.

I'm not saying this data-synchronization problem is unsolvable, but there's a reason why AMD is putting a big fat "HBX Crosslink" between their chiplets in this patent. Maybe this HBX-crosslink is good enough, or maybe not. AMD is surely going to build a prototype now (if they haven't already). If its good enough, maybe we'll see it in a product in a few years. If not, then it will be tossed into the trashbin and another design will be made.

-------

EDIT: I realize I'm still being handwavy. Here's a specific example which clearly relies upon shared-data states: Claybook (https://gpuopen.com/wp-content/uplo...rial_clay_sim_and_raytracing_in_claybook.pptx).

Youtube:

The physics + fluids of Claybook were implemented on the GPU. Its clear that the world-state (where tracks are, for example) is shared between all parallel processors and that various parts of the scene (shadows, verticies, etc. etc.) morph to that shared state.
 
Last edited:
Joined
Apr 11, 2008
Messages
93 (0.02/day)
Processor AMD 1700X
Motherboard Gigabyte X570 Aorus Master
Cooling Corsair H110iGT
Memory 16GB Corsair VENGEANCE LPX 3200 CL16
Video Card(s) VEGA 64 w/ Corsair H100
Storage 480GB Patriot Hellfire -- WB Blue SN550 1TB -- WD Red 2TB Raid 0 -- WD Black 4TB
Display(s) Dual Benq 24"
Case Corsair C70 VENGEANCE
Crossfire support with out needing Crossfire software.
NIiiiiiiiiice!
 
Joined
Sep 28, 2012
Messages
983 (0.22/day)
System Name Poor Man's PC
Processor Ryzen 7 9800X3D
Motherboard MSI B650M Mortar WiFi
Cooling Thermalright Phantom Spirit 120 with Arctic P12 Max fan
Memory 32GB GSkill Flare X5 DDR5 6000Mhz
Video Card(s) XFX Merc 310 Radeon RX 7900 XT
Storage XPG Gammix S70 Blade 2TB + 8 TB WD Ultrastar DC HC320
Display(s) Xiaomi G Pro 27i MiniLED
Case Asus A21 Case
Audio Device(s) MPow Air Wireless + Mi Soundbar
Power Supply Enermax Revolution DF 650W Gold
Mouse Logitech MX Anywhere 3
Keyboard Logitech Pro X + Kailh box heavy pale blue switch + Durock stabilizers
VR HMD Meta Quest 2
Benchmark Scores Who need bench when everything already fast?
Doesn't sound like CF. This seems to present itself to the system as a single GPU, in other words it won't look like anything special to games, hopefully requiring no tuning beyond drivers. CF support was highly reliant on developer support, which is of course why it (and SLI) died. If they're actually implementing this for consumer use it would need to be a low-to-no-effort solution.

Figuratively :D
Dx12 mGPU implicit been around for years, and theoretically can present just one GPU for system with multi adapters and yet none of developer take advantages of it. Just like what you said, a prime example of EA's Battlefield, I can run Battlefield 1 with Vega 56 Crossfire and suddenly borked in Battlefield V even though it needs more of GPU power. Several matter could come into hands, the most obvious is temporal anti aliasing will not work and developer need to write another thousand rows of code. These approach would eliminating API needed and bloated codes as hardware itself is controlling it.
Still, this is too much for average joe like me, I mean you could "easily" parallel a serial processor ( chiplet design on Ryzen CPU ), but what it takes to parallel already parallel processor ( GPU ), or arrange all those parallel processor as serial? :confused:
 
Joined
Jan 8, 2017
Messages
9,627 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
Geometry shaders will read and write all the data in parallel. Then when the vertex shaders run after those, they'll all read and write the data together. Finally, pixel shaders process the data all in parallel and eventually deposit their results into the framebuffer.

Yes, they read and write data in parallel but they do so with no dependencies outside a single CU, you still didn't provide an actual example when this occurs in practice.
There's only one framebuffer (maybe two if you're playing VR for your two eyes: but those two framebuffers must remain in sync regardless). Any geometry shader which tessellates... any vertex shader custom-animation thing... any shared data at all must be synchronized across all compute units. There's only one mipmapped copy of Gordon Freeman's face in VRAM, and that copy is shared to all wavefronts that may want to render any piece of Gordon Freeman's face (texture, vertex, tessellation data).

You can't synchronize globally from within a compute unit, that's the point, well you can attempt to but nothing guarantees that you wont end up in a deadlock. The mipmap is going to generate a texture reference which will have to be read only, therefore it wont impose any problems when multiple wavefronts will need to fetch it. Reading and writing simultaneously from a single texture for instance is undefined behavior because there are no mechanisms to ensure coherency, hell until relatively recently APIs wouldn't even let you write to a texture. The reason for all of this is that there is no proper way to know if a thread from compute unit A finishes executing before or after one in compute unit B.


But I have to point out again that if every wavefront needing to fetch something would require synchronization the performance implications would be ridiculous.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,797 (1.61/day)
You can't synchronize globally from within a compute unit, that's the point, well you can attempt to but nothing guarantees that you wont end up in a deadlock

You know GPUs have globally ordered atomic reads/writes to the L2 cache, right? Both NVidia and AMD.

you still didn't provide an actual example when this occurs in practice.

So you didn't read the Claybook powerpoint. Maybe you don't have powerpoint?

1609863047684.png


Atomic reads/writes to the same locations in memory. Some of it is optimized to an atomic read/write to __local__ memory, but some of it must happen globally synchronized.

That's a global write across compute units being consolidated into the same memory locations. The only way to synchronize that effectively is through atomic operations, which modern GPUs can actually do very quickly.

------

Its hard to come up with a single example, because most examples of writing to the same memory location are considered trivial. I don't think people talk about it because its happening all the freaking time. That's the very reason atomics + memory barriers are implemented on modern GPUs.

The hard part is finding ways to localize and minimize the naive approach. The Claybook devs have noted a lot of areas where they were able to synchronize to __local__ or __shared__ memory instead (which is much faster for this kind of stuff). But that doesn't change the fact that the ultimate data must be synchronized to the global state eventually... and such synchronization is going to happen in different CUs.

I dunno, do you need to see the simpler programming examples to believe me?

 
Last edited:
D

Deleted member 24505

Guest
That chip to chip interconnect better be the speed of light.

It won't be. it's not on the cpu's so why would this be different.

Is there a good reason for them to use the same design on the gpu as the cpu, or are they just stroking themselves because it seems to be working so well on the cpu's?

Is there actually an advantage to this on the gpu? why is monolithic gpu design suddenly cack? or is everyone gonna just go hoorah for Amd as Intel are now the "underdog"
 
Joined
Apr 24, 2020
Messages
2,797 (1.61/day)
Joined
Jan 8, 2017
Messages
9,627 (3.28/day)
System Name Good enough
Processor AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard ASRock B650 Pro RS
Cooling 2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory 32GB - FURY Beast RGB 5600 Mhz
Video Card(s) Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage 1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s) LG UltraGear 32GN650-B + 4K Samsung TV
Case Phanteks NV7
Power Supply GPS-750C
You know GPUs have globally ordered atomic reads/writes to the L2 cache, right? Both NVidia and AMD.

What do atomic operations have to do with any of that ? Atomic operations on GPUs simply work by serializing the execution of the threads not by synchronizing them globally, again for the last time it would be insane to synchronize potentially hundreds of thousands of threads, with most of them not even been in-flight.

And they are dog slow by the way for that reason.
 
Joined
Apr 24, 2020
Messages
2,797 (1.61/day)
What do atomic operations have to do with any of that ? Atomic operations on GPUs simply work by serializing the execution of the threads not by synchronizing them globally, again for the last time it would be insane to synchronize potentially hundreds of thousands of threads, with most of them not even been in-flight.

If the threads are not in-flight, then there's absolutely no chance of memory race conditions. Atomics only need to serialize in-flight threads.

My point remains: there are global data-structures on GPUs that must remain synchronized. Currently, that synchronization mechanism is L2 cache atomics. Your proposal to split the data between chiplets ignores the synchronization mechanism of typical GPU code.

"Render Texture" in modern game engines (see https://docs.unity3d.com/Manual/class-RenderTexture.html) is another situation where multiple CUs have to write to the same memory (or at least: the same texture). You almost certainly want to render that texture as parallel as possible. From my understanding, render-to-texture at runtime is used in all sorts of advanced effects (https://blog.theknightsofunity.com/practical-use-of-render-textures/).

GPU effects on modern video games are now a complex web of dependencies. That's just where the industry has gone in the past 10 years: these sorts of advanced effects are cheap to calculate and look really good.
 
Last edited:
Joined
May 2, 2017
Messages
7,762 (2.75/day)
Location
Back in Norway
System Name Hotbox
Processor AMD Ryzen 7 5800X, 110/95/110, PBO +150Mhz, CO -7,-7,-20(x6),
Motherboard ASRock Phantom Gaming B550 ITX/ax
Cooling LOBO + Laing DDC 1T Plus PWM + Corsair XR5 280mm + 2x Arctic P14
Memory 32GB G.Skill FlareX 3200c14 @3800c15
Video Card(s) PowerColor Radeon 6900XT Liquid Devil Ultimate, UC@2250MHz max @~200W
Storage 2TB Adata SX8200 Pro
Display(s) Dell U2711 main, AOC 24P2C secondary
Case SSUPD Meshlicious
Audio Device(s) Optoma Nuforce μDAC 3
Power Supply Corsair SF750 Platinum
Mouse Logitech G603
Keyboard Keychron K3/Cooler Master MasterKeys Pro M w/DSA profile caps
Software Windows 10 Pro
It won't be. it's not on the cpu's so why would this be different.
CPUs use an IF interconnect through a PCB substrate. This uses a non-IF interconnect through an active (presumably silicon) interposer. It would thus be different because it's using different technology. The use of an interposer is likely to allow for far higher bandwidth, lower latency, and better signal integrity.
Is there a good reason for them to use the same design on the gpu as the cpu, or are they just stroking themselves because it seems to be working so well on the cpu's?
They aren't. See above. Also, CPU cores are individually visible to the system; GPU cores are hidden behind one monolithic "device". This seems to be continuing the latter, just across several chiplets, not implementing a form of mGPU (several GPU devices visible to the system).
Is there actually an advantage to this on the gpu? why is monolithic gpu design suddenly cack? or is everyone gonna just go hoorah for Amd as Intel are now the "underdog"
Advantage? Well, current GPUs are really expensive to manufacture as they are huge high-performance, high-power dice on cutting-edge production nodes. Navi 21 with 80 CUs is 550mm² on 7nm, making it likely very, very expensive to produce. Now imagine next-gen 100-120CU GPUs on 5nm. Sure, density will increase, but you'll be talking about far more transistors and still huge area. Splitting that into, for example, three ~250mm2 dice would dramatically lower production costs, while likely allowing for higher overall CU counts (as you can get more functional parts with lower risks of defects). Smaller dice also let you utilize more of the wafer, further increasing yields. Chiplets are undoubtedly the way of the future for all high performance computing, it's just a question of getting it to work as well as monolithic parts.

An example - these die sizes are made up, and I'm assuming linear CU scaling per area here which is a bit on the optimistic side, but reasonably representative. Yield rate is based on published TSMC yield data of 0.09 defects/cm². Yield calculations from here.
-A 15x15mm die (225mm²) fits 256 dice per 300mm wafer with ~46 (partially or fully) defective dice.
-A 15x30mm die (450mm²) fits 116 dice per 300mm wafer with ~38 (partially or fully) defective dice.
In other words halving die size increases dice per wafer by 120%, while defective dice only increase by 21%. In other words you're left with far more fault-free silicon, giving a lot more flexibility in binning and product segmentation. If 10% of error-free dice hit the top clock speed/voltage bin, that's ~7.8 dice per wafer on the larger die or ~21 dice per wafer on the smaller die. If two of the smaller can be combined to work as an equivalent of one of the larger, you then have ~2,7 more useable flagship GPUs per wafer or a 34% increase in yields for the top bin, and far more flexibility for the remainder, seeing how it can be used for either 1- or 2-die configurations, giving a much wider possible range of configurations and thus a higher chance of utilizing faulty dice too (a >50% cut-down SKU of a large die is immensely wasteful and extremely unlikely to happen, after all).

Also, cost goes way down. Assuming a per-wafer cost of $20 000, the smaller die ends up at ~$95/die (not counting defective dice), while the larger die ends up at ~$256 (same). Taking into account the higher likelihood of being able to use the defective dice from the smaller design, that further brings down prices. If 50% of defective small dice can be used vs. 30% of defective large dice, that's ~$86 vs. ~$206.

Nobody is saying monolithic GPU designs are cack, but we're reaching the practical and economical upper limits of monolithic GPU die designs. If we want better performance at even remotely accessible prices in the future, we need new ways of making these chips.
 
Joined
Mar 21, 2016
Messages
2,533 (0.78/day)
This seems to be along the lines of what I said not long ago back. AMD doesn't specifically mention any form of a infinity cache buffer, but at the very least their implying a infinity fabric connection, but I do believe infinity cache would be used in tandem as well. Instead 128MB infinity cache for one monolithic GPU they could segment it into 16MB per GPU across 4 GPU chips connected with infinity fabric using that as a buffer between them along with a additional 16MB buffer across the PCIE bus per chip or something akin to that.

"The prospects of 4 discrete GPU's with a chunk of infinity cache on each running to a CPU with larger chunk of infinity cache that it can split amongst them is a very real future and vastly better than 4-way GTX980/980Ti's was with those old slower and less multicore Intel workstation chips and motherboards like setup is archaic to what we've got now it may as well be a 486 it just looks so dated this current tech in so many area's."
 
Top