AMD Radeon VII Detailed Some More: Die-size, Secret-sauce, Ray-tracing, and More

ValenOne · Jan 17, 2019

mtcn77 said:
I concur, however I was pointing out that the IMC has less consequences in a TBR & L2-ROP design. AMD would certainly be able to clock the gpu higher in case they integrated TBR, but also most of Nvidia's advantage is due to r:w amplification through TBR, not frequency alone. They can only write 616GB/s, yes, but setup occurs in reference of texture reads at 1.5TB/s.

Running CUDA apps disables delta color compression.
NVIDIA Maxwell/Pascal/Turing GPUs doesn't have PowerVR's "deferred tile render" but it has immediate mode tile cache render.

For my GTX 1080 Ti and 980 Ti GPUs, I can increase L2 cache bandwidth with an overclock.

Vega 56 at higher clock speed still has performance increase without increasing memory bandwidth and Vega ROPS has multi-MB L2 cache connection like Maxwell/Pascal's ROPS designs.
VII rivalling the fastest Turing GPU with 64 ROPS would be RTX 2080.

Battlefield series games are well known for software tiled compute render techniques which maximises older AMD GCNs with L2 cache connections with TMUs.

For Vega architecture from https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf
From AMD's white paper

Vega uses a relatively small number of tiles, and it operates on primitive batches of limited size compared with those used in previous tile-based rendering architectures. This setup keeps the costs associated with clipping and sorting manageable for complex scenes while delivering most of the performance and efficiency benefits.

AMD Vega Whitepaper:

The Draw-Stream Binning Rasterizer (DSBR) is an important innovation to highlight. It has been designed to reduce unnecessary processing and data transfer on the GPU, which helps both to boost performance and to reduce power consumption. The idea was to combine the benefits of a technique already widely used in handheld graphics products (tiled rendering) with the benefits of immediate-mode rendering used high-performance PC graphics.

Pixel shading can also be deferred until an entire batch has been processed, so that only visible foreground pixels need to be shaded. This deferred step can be disabled selectively for batches that contain polygons with transparency. Deferred shading reduces unnecessary work by reducing overdraw (i.e., cases where pixel shaders are executed multiple times when di erent polygons overlap a single screen pixel).

PowerVR's deferred tile render is patent heavy.

mtcn77 · Jan 17, 2019

rvalencia said:
Running CUDA apps disables delta color compression.
NVIDIA Maxwell/Pascal/Turing GPUs doesn't have PowerVR's "deferred tile render" but it has immediate mode tile cache render.

View attachment 114641

For my GTX 1080 Ti and 980 Ti GPUs, I can increase L2 cache bandwidth with an overclock.

Vega 56 at higher clock speed still has performance increase without increasing memory bandwidth and Vega ROPS has multi-MB L2 cache connection like Maxwell/Pascal's ROPS designs.
VII rivalling the fastest Turing GPU with 64 ROPS would be RTX 2080.

Battlefield series games are well known for software tiled compute render techniques which maximises older AMD GCNs with L2 cache connections with TMUs.

For Vega architecture from https://radeon.com/_downloads/vega-whitepaper-11.6.17.pdf
From AMD's white paper

Vega uses a relatively small number of tiles, and it operates on primitive batches of limited size compared with those used in previous tile-based rendering architectures. This setup keeps the costs associated with clipping and sorting manageable for complex scenes while delivering most of the performance and efficiency benefits.

AMD Vega Whitepaper:

The Draw-Stream Binning Rasterizer (DSBR) is an important innovation to highlight. It has been designed to reduce unnecessary processing and data transfer on the GPU, which helps both to boost performance and to reduce power consumption. The idea was to combine the benefits of a technique already widely used in handheld graphics products (tiled rendering) with the benefits of immediate-mode rendering used high-performance PC graphics.
Pixel shading can also be deferred until an entire batch has been processed, so that only visible foreground pixels need to be shaded. This deferred step can be disabled selectively for batches that contain polygons with transparency. Deferred shading reduces unnecessary work by reducing overdraw (i.e., cases where pixel shaders are executed multiple times when di erent polygons overlap a single screen pixel).

PowerVR's deferred tile render is patent heavy.

Don't be difficult, all I'm saying is Nvidia wins by texture 'read' bandwidth; Vega VII and 2080 are supposedly equally matched in this regard(if 2080Ti has 1.5TB/s), so I'm saying it is not because of 'write' concurrency - which LDS shared space solves - that causes it. AMD does not recommend shared memory if the indices don't follow through on reading similar registers. I'm not disputing you can keep up to Rx580 by overclocking the 570(since both's TMU's are ROP-bound), but Vega VII has 3x more fillrate at 4x bandwidth - there is much more to spare until we start L2 bashing.

ValenOne · Jan 18, 2019

mtcn77 said:
Don't be difficult, all I'm saying is Nvidia wins by texture 'read' bandwidth; Vega VII and 2080 are supposedly equally matched in this regard(if 2080Ti has 1.5TB/s), so I'm saying it is not because of 'write' concurrency - which LDS shared space solves - that causes it. AMD does not recommend shared memory if the indices don't follow through on reading similar registers. I'm not disputing you can keep up to Rx580 by overclocking the 570(since both's TMU's are ROP-bound), but Vega VII has 3x more fillrate at 4x bandwidth - there is much more to spare until we start L2 bashing.

What you talking about "since both's TMU's are ROP-bound" when TMUs are the workaround path for ROPS bound?

In terms of raw ROPS read/write hardware capabilities,

VII's ROPS (64 ROPS at 1800Mhz connected to multi-MB L2 cache) is 2.7X of RX-580's ROPS (32 ROPS at 1340 Mhz connected to memory controllers)
VII's 1TB/s raw memory bandwidth is 4X of RX-580's 256GB/s raw memory bandwidth. Vega's DCC is slightly better than Polaris DCC.

mtcn77 · Jan 23, 2019

mtcn77 said:
I concur, however I was pointing out that the IMC has less consequences in a TBR & L2-ROP design. AMD would certainly be able to clock the gpu higher in case they integrated TBR, but also most of Nvidia's advantage is due to r:w amplification through TBR, not frequency alone. They can only write 616GB/s, yes, but setup occurs in reference of texture reads at 1.5TB/s.

Something has been bothering me ever since... R:W is not the same across the bandwidth spectrum. You need to port accesses for reads in order to gain full throughput - not equal, "read-biased bandwidth".

In fact, it's horribly slow, because a GCN CU can process 64 float multiply-add instructions per cycle, which is 64×3×4 bytes of input data, and 64×4 bytes of out. Across a large chip like a Vega10, that's 48 KiB worth of data read in a single cycle -- at 1.5 GHz, that's 67 TiB of data you'd have to read in.

ValenOne · Feb 15, 2019

mtcn77 said:
Something has been bothering me ever since... R:W is not the same across the bandwidth spectrum. You need to port accesses for reads in order to gain full throughput - not equal, "read-biased bandwidth".

Real world shader program is more than a single line FMA operation and each CU has local data storage and L1 cache. Smaller shader loop should be able to fit within CU's local storage.

At 1 Ghz, R9-290X's 1MB L2 cache has 1TB/s bandwidth, but this bandwidth is not connected to ROPS until Vega era IP. RX-480 and Fury X has 2 MB L2 cache and it's not connected to ROPS.

Xbox One X GPU has 2 MB L2 cache for TMU and 2MB render cache for ROPS (feature missing in Polaris IP).

For raster graphics, ROPS is the primary read/write units to expose the GPU TFLOPS into L2 cache and external memory bandwidth. AMD has been pushing for async compute which uses texture unit as read-write.

Nvidia's memory compression superiority with Pascal.

AMD needs to use quad stack HBM v2's 1 TB/s memory bandwidth to rival RTX 2080's effective memory bandwidth.

System Name	Eula
Processor	AMD Ryzen 9 7950X
Motherboard	MSI MPG B850 Edge Ti WiFi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 CL32-38-38-96 1.40V 64GB (2x32GB) AMD EXPO F5-6000J3238G32GX2-TZ5NR
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Crucial P3 Plus, 4 TB NVMe, Samsung 980 Pro 2TB NVMe, Toshiba N300 10TB HDD, WDC Red Pro NAS HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K, Corsair Xeneon 32UHD144 32in 144 hz 4K
Case	Antec Constellation C8 RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

System Name	Eula
Processor	AMD Ryzen 9 7950X
Motherboard	MSI MPG B850 Edge Ti WiFi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 CL32-38-38-96 1.40V 64GB (2x32GB) AMD EXPO F5-6000J3238G32GX2-TZ5NR
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Crucial P3 Plus, 4 TB NVMe, Samsung 980 Pro 2TB NVMe, Toshiba N300 10TB HDD, WDC Red Pro NAS HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K, Corsair Xeneon 32UHD144 32in 144 hz 4K
Case	Antec Constellation C8 RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

System Name	Eula
Processor	AMD Ryzen 9 7950X
Motherboard	MSI MPG B850 Edge Ti WiFi
Cooling	Corsair H150i Elite LCD XT White
Memory	Trident Z5 Neo RGB DDR5-6000 CL32-38-38-96 1.40V 64GB (2x32GB) AMD EXPO F5-6000J3238G32GX2-TZ5NR
Video Card(s)	Gigabyte GeForce RTX 4080 GAMING OC
Storage	Crucial P3 Plus, 4 TB NVMe, Samsung 980 Pro 2TB NVMe, Toshiba N300 10TB HDD, WDC Red Pro NAS HDD
Display(s)	Acer Predator X32FP 32in 160Hz 4K, Corsair Xeneon 32UHD144 32in 144 hz 4K
Case	Antec Constellation C8 RGB White
Audio Device(s)	Creative Sound Blaster Z
Power Supply	Corsair HX1000 Platinum 1000W
Mouse	SteelSeries Prime Pro Gaming Mouse
Keyboard	SteelSeries Apex 5
Software	MS Windows 11 Pro

AMD Radeon VII Detailed Some More: Die-size, Secret-sauce, Ray-tracing, and More

ValenOne

mtcn77

ValenOne

mtcn77

ValenOne