AMD Ryzen Threadripper 2970WX Review 21

AMD Ryzen Threadripper 2970WX Review

Test Setup »

The Threadripper Concept

Ryzen Threadripper 2970WX is a multi-chip module of four 8-core, 12 nm "Pinnacle Ridge" dies. Each of the four dies has two cores disabled, which leaves us with 24 cores in all. Think of this as 4P Ryzen 5 six-core-on-a-stick. As we explained earlier, only two out of four dies has their memory controllers wired out to memory slots on the motherboard. Cores in these dies are called "I/O cores" by AMD. The other two dies have no direct access and rely on the Infinity Fabric interconnect to access memory controlled by a neighboring die. Cores from these dies are called "compute cores." The same scheme applies to other I/O, such as PCIe, SATA, USB, audio, etc. In the pre-IMC days, memory controllers were located on a separate chip on the motherboard, called a northbridge. In a way, those compute cores are configured like processors from that era.

Each of the I/O dies controls two DDR4 memory channels and 32 PCIe lanes for a combined quad-channel DDR4 memory interface and 64-lane PCIe. Despite disabled cores, you get the full 16 MB of L3 cache per die and hence, 64 MB of total L3 cache for the entire processor. Under the IHS, the die closest to the key corner and the die diagonally opposite to it are the I/O dies. The other two dies are compute dies. AMD decided not to wire out these dies on the platform to ensure backwards compatibility with X399 motherboards.


Infinity Fabric is AMD's new high-bandwidth interconnect introduced alongside "Zen." It connects not just the two quad-core CCX chiplets on the "Pinnacle Ridge" die, but also handles inter-die communication. On the 4-die Threadripper 2970WX, there's one Infinity Fabric link between the two active dies with a bi-directional bandwidth of 25 GB/s when running at 1600 MHz (the actual DRAM frequency). So if you run faster or slower memory, Infinity Fabric's bandwidth will scale accordingly. It takes around 105 nanoseconds (ns) for a CPU core to access memory controlled by the neighboring die, and less than 65 ns to access memory controlled by its own die.


Unlike Core X processors that are built with four memory channels wired to a single die, Threadrippers have two dual-channel interfaces making up quad-channel. It is possible for an application to spread its memory across all four channels for higher bandwidth memory access, but at higher latency. Less parallelized applications, such as PC games (which still haven't managed to need >16 GB of memory), can benefit from lower latency. AMD figured out a way to give users and their operating systems control over how to allocate memory because of UMA and NUMA.


To that end, there are several selectable user modes through Ryzen Master, which reconfigure the processor on the fly (reboot required). Memory access mode can be toggled between "Distributed Mode" (default) and "Local Mode". Distributed maximizes memory bandwidth to applications and tries to keep latencies constant (but higher), no matter which core the software is running on. Local mode, on the other hand, splits the system into two NUMA nodes (think "processor groups"), which allows Windows to know which cores have the memory interface attached to them for it to put the loads on those cores first, to run them with lower memory latency. The second processor group has higher memory latency, which results in applications on those cores running slower. This mode can be useful for low-threaded application and games. Our performance results have an additional data set for "Local Mode" enabled.

A third configuration option is "Legacy Compatibility Mode", which lets you adjust the exposed processor count. Some older games have difficulty running on systems with more than 16 cores and will crash right at the start. Using that option, you can reduce the core count of Threadripper.

A few weeks ago, just as Intel refreshed its HEDT lineup with the Core X 9000-series, AMD introduced Dynamic Local Mode, a software feature part of Ryzen Master which significantly improves performance of 24-core and 32-core Threadripper WX-series models. It works by running a background process that automatically allocates workloads to dies with local memory access first, and only when those cores are completely saturated does it invoke the cores without local memory access. Since all dies on the 12-core and 16-core models have local memory access, Dynamic Local Mode isn't applicable.

Unlike the socket AM4 Ryzen chips, Threadrippers have an unchanged memory controller configuration from AMD's EPYC enterprise processors. The Ryzen Threadripper 2920X supports up to 2 TB of quad-channel memory with ECC support (something like that is restricted to the Xeon brand in the Intel platform). Then again, we doubt HEDT users are going to need more than the 128 GB of memory Core X processors support.

The PCI-Express configuration is interesting. The MCM puts out a total of 64 PCIe gen 3.0 lanes. On a typical motherboard, these lanes are wired out as two PCI-Express 3.0 x16 slots that run at x16 bandwidth all the time, two additional x16 slots that run at x8 bandwidth all the time (without eating into the bandwidth of another slot), three M.2-NVMe slots with x4 bandwidth, each, and the remaining 4 lanes serving as chipset bus.

The Zen+ Architecture


Each of the four dies in the Threadripper 2970WX MCM is made out of the new 12 nm "Pinnacle Ridge" silicon by AMD. This chip is based on the new "Zen+" micro-architecture in which the "+" denotes refinement rather than a major architectural change.


AMD summarizes the "+" in "Zen+" as the coming together of the new 12 nm process that enables higher clock speeds, an updated SenseMI feature set, the updated Precision Boost algorithm that sustains boost clocks better under stress, and physical improvements to the cache and memory sub-systems, which add up to an IPC uplift of 3 percent (clock-for-clock) over the first-generation "Zen."

The biggest change of "Pinnacle Ridge" remains its process node. The switch to 12 nm resulted in a 50 mV reduction in Vcore voltage at any given clock speed, enabling AMD to increase clocks by around 0.25 GHz across the board. The switch also enables all-core overclocks well above the 4 GHz mark, to around 4.20 GHz. Last but not least, this increase in power efficiency enabled AMD to release the 32-core Threadripper 2990WX, which wasn't feasible before.

AMD also deployed faster cache SRAM and refined the memory controllers to bring down latencies significantly. L3 cache latency is 16 percent lower, L2 cache latency is a staggering 34 percent lower, L1 latencies are reduced by 13 percent, and DRAM (memory) latencies by 11 percent. This is where almost all of the IPC uplift comes from. AMD also increased the maximum memory clocks. The processor now supports up to DDR4-2933 (JEDEC).


Updates to the chip's on-die SenseMI logic include Precision Boost 2 and Extended Frequency Range (XFR) 2. Precision Boost 2 now switches from arbitrary 2-core and all-core boost targets to a perpetual all-core boosting algorithm that elevates the most stressed cores to the highest boost states in a linear fashion (i.e. boost frequency increases with load). Every core is running above the nominal clock when the processor isn't idling, which contributes to a multi-core performance uplift. Besides load, the algorithm takes into account temperature, current, and Vcore. Granularity is 0.25X base clock (25 MHz).


Extended Frequency Range 2 (XFR 2) builds on the success of XFR with a new all-core uplift beyond the maximum boost clock. If your cooling is good enough (60°C), XFR will now elevate all cores beyond the boost state as opposed to just the best few cores. AMD claims that with the most ideal cooling, XFR 2.0 will give you a staggering 7 percent performance uplift without any manual overclocking on your part.

The AMD X399 Chipset

An overwhelming majority of the PCIe I/O is handled directly by the socketed TR4 processor, leaving the chipset with far fewer bandwidth-heavy onboard devices to control. Since the "Pinnacle Ridge" die was designed to be a SoC in its own right, the TR4 processor also puts out some I/O directly from the CPU socket, such as four SATA 6 Gbps ports, eight USB 3.0 ports, and the High Definition Audio bus.


The X399 has a near-identical feature-set to the X370. It puts out eight PCI-Express gen 2.0 lanes, which drives onboard network controllers or wiring out for some x1 slots. The chipset puts out six additional SATA 6 Gbps ports with AHCI-RAID support, as well as eight USB 3.0 and two USB 3.1 gen 2 ports. The platform supports NVIDIA SLI and AMD CrossFire X multi-GPU technologies.
Next Page »Test Setup
View as single page
Dec 22nd, 2024 20:24 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts