• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

How GDDR7 works compared to GDDR6

Joined
Jan 23, 2025
Messages
2 (0.67/day)
Processor R5 3600 @ 4.3 GHz
Motherboard B450 Tomahawk Max
Cooling Noctua NH-D14
Memory 2x8 GB DDR4-3800 CL14
Video Card(s) Arc A770 LE
Storage Samsung 960 EVO
Benchmark Scores https://www.3dmark.com/spy/40037915
Hello everyone, I'm making this post because following the RTX 5090 launch I was confused about how exactly GDDR7 operates, especially in the PAM3 encoding mode which is used in Blackwell. I've done some reading, and I believe I'm prepared to explain for those of you who are interested, but part of the reason I'm doing this is because I'm hoping someone will correct me if I'm wrong! I of course encourage you to read the spec yourself as it's freely available from JEDEC's website (you just have to make an account). I know it's 340 pages long, but there are only a few pages I'm interested in.

Before we begin, we need to define some terms. First of all is the difference between bit-rate and baud, which becomes an important distinction with GDDR7. Bit-rate is the straightforward number of bits per second that are being transferred. Baud, on the other hand, is the number of symbols per second. For GDDR6, which used NRZ encoding (Non-Return to Zero), the bit-rate and the baud are equal, because each symbol in NRZ can only communicate one bit's worth of information. However, with GDDR7's PAM3 encoding (Pulse-Amplitude Modulation), each symbol transferred can actually communicate one of three distinct values, or one ternary digit. This is awkward because the amount of information carried by a PAM3 symbol is between 1 and 2 bits; there isn't a perfect correspondence.

Secondly, the clock. The clock we normally report for memory speeds is called CK4. CK4 is a clock internal to the memory chip that runs at a quarter the speed of the bus clock. In the case of the 5090, CK4 = 1750 MHz. As a point of comparison, I'll also include the 5700 XT, which uses GDDR6 also at CK4 = 1750 MHz. Because CK4 is equal in both cases, the bus speed of both the 5090 and the 5700 XT is 14 gigabaud (1.75 GHz * 4 * 2). We multiply by four because CK4 is a quarter the bus speed, then again by two because of DDR (Double Data Rate).

This brings me to the fundamental question I was asking myself. The 5700 XT and the 5090 both run at 14 Gbaud, yet the 5700 XT is 14 Gbps and the 5090 is 28 Gbps. How can it be possible that the 5090's bit-rate is double that of the 5700 XT's, when PAM3 encodes less than twice as much data as NRZ? Clearly there's more to PAM3 mode than meets the eye.

This figure from the spec illustrates exactly what I'm talking about:
1737757513659.png


Each memory chip is thought of as being 32 bits wide. That's how a 5700 XT with a 256 bit wide memory bus can have 8 8Gb GDDR6 chips for a total of 8 GB, or how a 5090 with a 512 bit bus can have 16 16Gb GDDR7 chips for a total of 32 GB.

In the case of GDDR6, each memory chip is split into two independent 16 bit sub-channels. When you send a read request to a GDDR6 channel, you get back a burst of 256 bits of data. That's a burst length of 16 bits across 16 data lines. GDDR7 chips are upgraded in this regard; they're each split into 4 8 bit sub-channels. In order to maintain a burst size of 256 bits when operating in NRZ mode, the burst length is doubled from 16 to 32 bits (GDDR7 can operate in NRZ mode, but it doesn't in the 50 series to my knowledge).

Section 2.9.2 of the GDDR7 spec explains how a burst is PAM3 encoded before being transmitted, and this is where the answer lies. The 256 bits of data are encoded as a set of 176 PAM3 symbols; the details of how that happens are in there. The crucial piece is this: "In PAM3 mode GDDR7 SGRAMs transfer a total of 176 symbols per burst access over 11 data lines (burst length 16 * 11 DQs = 176 symbols)". So, to make up for the fact that PAM3 can't quite encode twice as much data as NRZ, they just add three more data lines! This makes the burst length 16 again, which maintains double the speed of NRZ mode.

I suppose you could argue that the 5090 actually has a 512 * 11/8 = 704 bit bus at 14 Gbaud, and that saying it's 512-bit at 28 Gbps is a convenient way to maintain comparability with older GDDR standards. Either way it still ends up being 1,792 GB/s in total.

I hope some of you found that interesting. And again, if I got anything wrong please feel free to correct me!
 
Joined
Oct 22, 2020
Messages
50 (0.03/day)
I suppose you could argue that the 5090 actually has a 512 * 11/8 = 704 bit bus
Very clear explanation, completely agree!

I want to add some preliminary (not sure that my calculations are ok) analysis of random access delays compared to GRRD6/6X - i.e. delay to getting 1 byte from memory after its address become known by GPU. This is NOT important for 3D rendering and AI, but may be important to some other calculations having more random memory access patterns.

One of the heavy differences in the spec from all earlier GDDR6/6X/5/5X memory variants is that Command+Address for each 8-bit channel is organized as 2 signal lines for column commands and 3 signal lines for Row Commands.

All typical GDDR7 commands (except some rare bank-related ones) are 8 CA NRZ symbols.

GDDR7 uses:
1737848917474.png

1737850590848.png

  • Typical CK4 clocks like 1.75GHz, period ~0.57ns
  • 2+3 Command&Address (CA) signal bits. While column & row commands are issued in parallel, the column command may be issued only after a row is activated (the parallel column command would target the previously selected row)
  • CA NRZ symbols been quad faster CK4 frequency.
  • DATA PAM3 symbols been octo faster CK4 frequency.
  • "ACTIVATE ROW xxxx" and "READ COLUMN xxxx" commands being 8 CA symbols, so 1 command per 2 CK4 period
  • The minimum burst read is 16 PAM3 symbols, so 2 CK4 periods for getting data
GDDR6X used:
1737848653622.png
1737848694411.png

  • Typical CK4 clocks like 2.5GHz, period ~0.4ns
  • 10 Command&Address (CA) signal bits + AddressBusInversion bit
  • CA NRZ symbols been twice faster CK4 frequency.
  • DATA PAM4 symbols been quad faster CK4 frequency.
  • "ACTIVATE ROW xxxx" and "READ COLUMN xxxx" commands being 2 CA symbols, so 1 command per CK4 period
  • The minimum burst read is 8 DATA PAM4 symbols, so 2 CK4 periods for getting data
GDDR6 used:
1737848741112.png
1737848781339.png

  • Typical CK4 clocks like 1.75GHz, period ~0.57ns
  • 10 Command&Address (CA) signal bits + AddressBusInversion bit
  • CA NRZ symbols been twice faster CK4 frequency.
  • DATA NRZ symbols been octo faster CK4 frequency.
  • "ACTIVATE ROW xxxx" and "READ COLUMN xxxx" commands being 2 CA symbols, so 1 command per CK4 period
  • The minimum burst read is 16 DATA NRZ symbols, so 2 CK4 periods for getting data
So, the semi random access within a single already activated row time is:
  • GDDR7: "READ COLUMN xxxx" (2 CK4) + timing delay + "Transmitting data" (2 CK4) 4 CK4 clocks total = 2.28 ns + timing delay
  • GDDR6X: "READ COLUMN xxxx" (1 CK4) + timing delay + "Transmitting data" (2 CK4) 3 CK4 clocks total = 1.2 ns + timing delay
  • GDDR6: "READ COLUMN xxxx" (1 CK4) + timing delay + "Transmitting data" (2 CK4) 3 CK4 clocks total = 1.71 ns + timing delay
However, the "timing delay" - between command issuing and execution - are typically much bigger in ns than above numbers; so those values "as is" is not so huge as far as I understand.
But it shows that supposing the ~same inner-IC timing between "end of decoding column address" and "start of outputting data" the GDDR7 have bigger delays. Those caused by its data transmission optimized for high throughput, not low latency.

From the other point of view, having each IC as independent-addressable 4 ICs would allow for random accessing more data in parallel)
 
Joined
Jan 3, 2021
Messages
3,738 (2.52/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
I have yet to chew through all that, I'll just note that LPDDR has a complicated system of latencies where read latency is in some cases different than write latency. I don't know if this applies to GDDR too but it might. Pay attention to that when looking for the right numbers in datasheets.

From the other point of view, having each IC as independent-addressable 4 ICs would allow for random accessing more data in parallel)
You mean ever narrower channels here? Yeah but the narrower the bus, the longer the burst length. Minimum unit of data transfer is 64 bytes just like in CPUs, right? Or do modern GPUs support smaller units too?
 
Joined
Oct 22, 2020
Messages
50 (0.03/day)
You mean ever narrower channels here?
Having more indpendently addressable channels is good for massive-multi-threaded (like typical GPU tasks) random-access

Burst sizes are fixed 256 bits = 32bytes per adressable segment.
  • for GDDR7: can get 4x 32bytes from 4 addresable segmets for one random access cycle
  • for GDDR6/6x: can get 2x 32bytes from 2 addresable segments for one random access cycle
 
Last edited:

eidairaman1

The Exiled Airman
Joined
Jul 2, 2007
Messages
43,279 (6.74/day)
Location
Republic of Texas (True Patriot)
System Name PCGOD
Processor AMD FX 8350@ 5.0GHz
Motherboard Asus TUF 990FX Sabertooth R2 2901 Bios
Cooling Scythe Ashura, 2×BitFenix 230mm Spectre Pro LED (Blue,Green), 2x BitFenix 140mm Spectre Pro LED
Memory 16 GB Gskill Ripjaws X 2133 (2400 OC, 10-10-12-20-20, 1T, 1.65V)
Video Card(s) AMD Radeon 290 Sapphire Vapor-X
Storage Samsung 840 Pro 256GB, WD Velociraptor 1TB
Display(s) NEC Multisync LCD 1700V (Display Port Adapter)
Case AeroCool Xpredator Evil Blue Edition
Audio Device(s) Creative Labs Sound Blaster ZxR
Power Supply Seasonic 1250 XM2 Series (XP3)
Mouse Roccat Kone XTD
Keyboard Roccat Ryos MK Pro
Software Windows 7 Pro 64
I knew gddr5 was quad pumped due to data rate 1750MHz*4=7000 or 7 Gbps gddr6 is octal pumped 1750MHz*8= 14000 or 14 Gbps
 
Top