As we mentioned in the introduction, when the Radeon RX 480 turned out to be an upper-mainstream product in June 2016, consumers half-expected the performance-thru-enthusiast segment "Vega" to launch by the end of the year. That didn't happen. What we got instead were not one, but two media events detailing the "Vega" architecture. On each occasion, we did a special article detailing the press-decks given to us by AMD. Our first article, dated early-January, focused on the broader points of the architecture. Our second article followed close to seven months later, getting into the finer points of the architecture. On this page, we'll summarize the most important content from both presentations.
The Radeon RX Vega 64 is based on the "Vega 10" GPU. This is a multi-chip module of a GPU die, and memory stacks, much like its logical predecessor, "Fiji" (R9 Fury X, R9 Nano). The GPU die is built on the 14 nm FinFET process and packs over 12 billion transistors. It's wired to two 32 Gbit HBM2 memory stacks over a 2048-bit wide memory interface, which is half the bus width of "Fiji", but is somewhat made up for with higher clocks. The bandwidth itself is less than that of the R9 Fury, at just 484 GB/s against the Fury's 512 GB/s, but AMD made some sweeping changes to the way in which it addresses memory, which should more than make up for the bandwidth deficit.
Memory management has traditionally been a problem area for AMD's otherwise stellar Graphics CoreNext architecture, and AMD added raw bus width and memory compression to overcome issues that were intrinsic to the architecture. With Vega, AMD is addressing those fundamental issues by using a memory concept called "High Bandwidth Cache," with the keyword being "cache." The "Vega" silicon addresses a large virtual-memory space that spans up to 512 TB, a tiny portion of which is physical, in this case the "cache." This allows for the fine-grained movement of data in and out of the virtual-address space based on data "heat" (frequency of access).
The essential hierarchy of the GPU doesn't appear to have changed with "Vega"; however, its designers seem to have made big changes with the front-end and back-ends of the rendering pipeline, while incrementally improving the Compute Units, the number-crunching components that do the heavy lifting.
The newer-generation programmable geometry pipeline provides double the throughput over the previous generation. The Next-Generation Compute Units (NGCUs) are based on the 5th generation Graphics CoreNext architecture and feature support for not just FP16 operations (introduced with "Polaris"), but also primitive 8-bit operations. Bolstered by Rapid Packed Math, each CU can handle up to 512 8-bit operations per clock cycle, and up to 256 16-bit ones. Quite a few of today's effects can be simplified to 16-bit or 8-bit ops, which frees up much of the CU's resources for other ops. Lastly, there is an improved pixel engine featuring a draw-binning rasterizer.
The "Vega 10" silicon features 64 NGCUs, each with 64 stream processors (SP), indivisible SIMD units which total the chip's SP count at 4,096 on the RX Vega 64 and 3,584 on the RX Vega 56. The chip also features 256 texture memory units (TMUs) and 64 ROPs. Its 2048-bit wide HBM2 memory interface holds 8 GB of memory (or high-bandwidth cache) on the Radeon RX Vega 64 and RX Vega 56.