• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Radeon VII Detailed Some More: Die-size, Secret-sauce, Ray-tracing, and More

Joined
Nov 3, 2011
Messages
695 (0.14/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
I concur, however I was pointing out that the IMC has less consequences in a TBR & L2-ROP design. AMD would certainly be able to clock the gpu higher in case they integrated TBR, but also most of Nvidia's advantage is due to r:w amplification through TBR, not frequency alone. They can only write 616GB/s, yes, but setup occurs in reference of texture reads at 1.5TB/s.
Running CUDA apps disables delta color compression.
NVIDIA Maxwell/Pascal/Turing GPUs doesn't have PowerVR's "deferred tile render" but it has immediate mode tile cache render.

nvidia tile cache.jpg


For my GTX 1080 Ti and 980 Ti GPUs, I can increase L2 cache bandwidth with an overclock.

Vega 56 at higher clock speed still has performance increase without increasing memory bandwidth and Vega ROPS has multi-MB L2 cache connection like Maxwell/Pascal's ROPS designs.
VII rivalling the fastest Turing GPU with 64 ROPS would be RTX 2080.

Battlefield series games are well known for software tiled compute render techniques which maximises older AMD GCNs with L2 cache connections with TMUs.


For Vega architecture from https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf
From AMD's white paper

Vega uses a relatively small number of tiles, and it operates on primitive batches of limited size compared with those used in previous tile-based rendering architectures. This setup keeps the costs associated with clipping and sorting manageable for complex scenes while delivering most of the performance and efficiency benefits.


AMD Vega Whitepaper:

The Draw-Stream Binning Rasterizer (DSBR) is an important innovation to highlight. It has been designed to reduce unnecessary processing and data transfer on the GPU, which helps both to boost performance and to reduce power consumption. The idea was to combine the benefits of a technique already widely used in handheld graphics products (tiled rendering) with the benefits of immediate-mode rendering used high-performance PC graphics.​
Pixel shading can also be deferred until an entire batch has been processed, so that only visible foreground pixels need to be shaded. This deferred step can be disabled selectively for batches that contain polygons with transparency. Deferred shading reduces unnecessary work by reducing overdraw (i.e., cases where pixel shaders are executed multiple times when di erent polygons overlap a single screen pixel).​



PowerVR's deferred tile render is patent heavy.
 
Last edited:
Joined
Jun 3, 2010
Messages
2,540 (0.48/day)
Running CUDA apps disables delta color compression.
NVIDIA Maxwell/Pascal/Turing GPUs doesn't have PowerVR's "deferred tile render" but it has immediate mode tile cache render.

View attachment 114641

For my GTX 1080 Ti and 980 Ti GPUs, I can increase L2 cache bandwidth with an overclock.

Vega 56 at higher clock speed still has performance increase without increasing memory bandwidth and Vega ROPS has multi-MB L2 cache connection like Maxwell/Pascal's ROPS designs.
VII rivalling the fastest Turing GPU with 64 ROPS would be RTX 2080.

Battlefield series games are well known for software tiled compute render techniques which maximises older AMD GCNs with L2 cache connections with TMUs.


For Vega architecture from https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf
From AMD's white paper

Vega uses a relatively small number of tiles, and it operates on primitive batches of limited size compared with those used in previous tile-based rendering architectures. This setup keeps the costs associated with clipping and sorting manageable for complex scenes while delivering most of the performance and efficiency benefits.


AMD Vega Whitepaper:

The Draw-Stream Binning Rasterizer (DSBR) is an important innovation to highlight. It has been designed to reduce unnecessary processing and data transfer on the GPU, which helps both to boost performance and to reduce power consumption. The idea was to combine the benefits of a technique already widely used in handheld graphics products (tiled rendering) with the benefits of immediate-mode rendering used high-performance PC graphics.​
Pixel shading can also be deferred until an entire batch has been processed, so that only visible foreground pixels need to be shaded. This deferred step can be disabled selectively for batches that contain polygons with transparency. Deferred shading reduces unnecessary work by reducing overdraw (i.e., cases where pixel shaders are executed multiple times when di erent polygons overlap a single screen pixel).​



PowerVR's deferred tile render is patent heavy.
Don't be difficult, all I'm saying is Nvidia wins by texture 'read' bandwidth; Vega VII and 2080 are supposedly equally matched in this regard(if 2080Ti has 1.5TB/s), so I'm saying it is not because of 'write' concurrency - which LDS shared space solves - that causes it. AMD does not recommend shared memory if the indices don't follow through on reading similar registers. I'm not disputing you can keep up to Rx580 by overclocking the 570(since both's TMU's are ROP-bound), but Vega VII has 3x more fillrate at 4x bandwidth - there is much more to spare until we start L2 bashing.
 
Joined
Nov 3, 2011
Messages
695 (0.14/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Don't be difficult, all I'm saying is Nvidia wins by texture 'read' bandwidth; Vega VII and 2080 are supposedly equally matched in this regard(if 2080Ti has 1.5TB/s), so I'm saying it is not because of 'write' concurrency - which LDS shared space solves - that causes it. AMD does not recommend shared memory if the indices don't follow through on reading similar registers. I'm not disputing you can keep up to Rx580 by overclocking the 570(since both's TMU's are ROP-bound), but Vega VII has 3x more fillrate at 4x bandwidth - there is much more to spare until we start L2 bashing.
What you talking about "since both's TMU's are ROP-bound" when TMUs are the workaround path for ROPS bound?

In terms of raw ROPS read/write hardware capabilities,

VII's ROPS (64 ROPS at 1800Mhz connected to multi-MB L2 cache) is 2.7X of RX-580's ROPS (32 ROPS at 1340 Mhz connected to memory controllers)
VII's 1TB/s raw memory bandwidth is 4X of RX-580's 256GB/s raw memory bandwidth. Vega's DCC is slightly better than Polaris DCC.
 
Last edited:
Joined
Jun 3, 2010
Messages
2,540 (0.48/day)
I concur, however I was pointing out that the IMC has less consequences in a TBR & L2-ROP design. AMD would certainly be able to clock the gpu higher in case they integrated TBR, but also most of Nvidia's advantage is due to r:w amplification through TBR, not frequency alone. They can only write 616GB/s, yes, but setup occurs in reference of texture reads at 1.5TB/s.
Something has been bothering me ever since... R:W is not the same across the bandwidth spectrum. You need to port accesses for reads in order to gain full throughput - not equal, "read-biased bandwidth".
In fact, it's horribly slow, because a GCN CU can process 64 float multiply-add instructions per cycle, which is 64×3×4 bytes of input data, and 64×4 bytes of out. Across a large chip like a Vega10, that's 48 KiB worth of data read in a single cycle -- at 1.5 GHz, that's 67 TiB of data you'd have to read in.
 
Joined
Nov 3, 2011
Messages
695 (0.14/day)
Location
Australia
System Name Eula
Processor AMD Ryzen 9 7900X PBO
Motherboard ASUS TUF Gaming X670E Plus Wifi
Cooling Corsair H150i Elite LCD XT White
Memory Trident Z5 Neo RGB DDR5-6000 64GB (4x16GB F5-6000J3038F16GX2-TZ5NR) EXPO II, OCCT Tested
Video Card(s) Gigabyte GeForce RTX 4080 GAMING OC
Storage Corsair MP600 XT NVMe 2TB, Samsung 980 Pro NVMe 2TB, Toshiba N300 10TB HDD, Seagate Ironwolf 4T HDD
Display(s) Acer Predator X32FP 32in 160Hz 4K FreeSync/GSync DP, LG 32UL950 32in 4K HDR FreeSync/G-Sync DP
Case Phanteks Eclipse P500A D-RGB White
Audio Device(s) Creative Sound Blaster Z
Power Supply Corsair HX1000 Platinum 1000W
Mouse SteelSeries Prime Pro Gaming Mouse
Keyboard SteelSeries Apex 5
Software MS Windows 11 Pro
Something has been bothering me ever since... R:W is not the same across the bandwidth spectrum. You need to port accesses for reads in order to gain full throughput - not equal, "read-biased bandwidth".
Real world shader program is more than a single line FMA operation and each CU has local data storage and L1 cache. Smaller shader loop should be able to fit within CU's local storage.

At 1 Ghz, R9-290X's 1MB L2 cache has 1TB/s bandwidth, but this bandwidth is not connected to ROPS until Vega era IP. RX-480 and Fury X has 2 MB L2 cache and it's not connected to ROPS.

Xbox One X GPU has 2 MB L2 cache for TMU and 2MB render cache for ROPS (feature missing in Polaris IP).

For raster graphics, ROPS is the primary read/write units to expose the GPU TFLOPS into L2 cache and external memory bandwidth. AMD has been pushing for async compute which uses texture unit as read-write.

Nvidia's memory compression superiority with Pascal.



Vega color compression BW gain.png


AMD needs to use quad stack HBM v2's 1 TB/s memory bandwidth to rival RTX 2080's effective memory bandwidth.
 
Last edited:
Top