How GDDR7 works compared to GDDR6

Eidolon · Jan 25, 2025

Hello everyone, I'm making this post because following the RTX 5090 launch I was confused about how exactly GDDR7 operates, especially in the PAM3 encoding mode which is used in Blackwell. I've done some reading, and I believe I'm prepared to explain for those of you who are interested, but part of the reason I'm doing this is because I'm hoping someone will correct me if I'm wrong! I of course encourage you to read the spec yourself as it's freely available from JEDEC's website (you just have to make an account). I know it's 340 pages long, but there are only a few pages I'm interested in.

Before we begin, we need to define some terms. First of all is the difference between bit-rate and baud, which becomes an important distinction with GDDR7. Bit-rate is the straightforward number of bits per second that are being transferred. Baud, on the other hand, is the number of symbols per second. For GDDR6, which used NRZ encoding (Non-Return to Zero), the bit-rate and the baud are equal, because each symbol in NRZ can only communicate one bit's worth of information. However, with GDDR7's PAM3 encoding (Pulse-Amplitude Modulation), each symbol transferred can actually communicate one of three distinct values, or one ternary digit. This is awkward because the amount of information carried by a PAM3 symbol is between 1 and 2 bits; there isn't a perfect correspondence.

Secondly, the clock. The clock we normally report for memory speeds is called CK4. CK4 is a clock internal to the memory chip that runs at a quarter the speed of the bus clock. In the case of the 5090, CK4 = 1750 MHz. As a point of comparison, I'll also include the 5700 XT, which uses GDDR6 also at CK4 = 1750 MHz. Because CK4 is equal in both cases, the bus speed of both the 5090 and the 5700 XT is 14 gigabaud (1.75 GHz * 4 * 2). We multiply by four because CK4 is a quarter the bus speed, then again by two because of DDR (Double Data Rate).

This brings me to the fundamental question I was asking myself. The 5700 XT and the 5090 both run at 14 Gbaud, yet the 5700 XT is 14 Gbps and the 5090 is 28 Gbps. How can it be possible that the 5090's bit-rate is double that of the 5700 XT's, when PAM3 encodes less than twice as much data as NRZ? Clearly there's more to PAM3 mode than meets the eye.

This figure from the spec illustrates exactly what I'm talking about:

Each memory chip is thought of as being 32 bits wide. That's how a 5700 XT with a 256 bit wide memory bus can have 8 8Gb GDDR6 chips for a total of 8 GB, or how a 5090 with a 512 bit bus can have 16 16Gb GDDR7 chips for a total of 32 GB.

In the case of GDDR6, each memory chip is split into two independent 16 bit sub-channels. When you send a read request to a GDDR6 channel, you get back a burst of 256 bits of data. That's a burst length of 16 bits across 16 data lines. GDDR7 chips are upgraded in this regard; they're each split into 4 8 bit sub-channels. In order to maintain a burst size of 256 bits when operating in NRZ mode, the burst length is doubled from 16 to 32 bits (GDDR7 can operate in NRZ mode, but it doesn't in the 50 series to my knowledge).

Section 2.9.2 of the GDDR7 spec explains how a burst is PAM3 encoded before being transmitted, and this is where the answer lies. The 256 bits of data are encoded as a set of 176 PAM3 symbols; the details of how that happens are in there. The crucial piece is this: "In PAM3 mode GDDR7 SGRAMs transfer a total of 176 symbols per burst access over 11 data lines (burst length 16 * 11 DQs = 176 symbols)". So, to make up for the fact that PAM3 can't quite encode twice as much data as NRZ, they just add three more data lines! This makes the burst length 16 again, which maintains double the speed of NRZ mode.

I suppose you could argue that the 5090 actually has a 512 * 11/8 = 704 bit bus at 14 Gbaud, and that saying it's 512-bit at 28 Gbps is a convenient way to maintain comparability with older GDDR standards. Either way it still ends up being 1,792 GB/s in total.

I hope some of you found that interesting. And again, if I got anything wrong please feel free to correct me!

StViolenceDay · Jan 26, 2025

Eidolon said:
I suppose you could argue that the 5090 actually has a 512 * 11/8 = 704 bit bus

Very clear explanation, completely agree!

I want to add some preliminary (not sure that my calculations are ok) analysis of random access delays compared to GRRD6/6X - i.e. delay to getting 1 byte from memory after its address become known by GPU. This is NOT important for 3D rendering and AI, but may be important to some other calculations having more random memory access patterns.

One of the heavy differences in the spec from all earlier GDDR6/6X/5/5X memory variants is that Command+Address for each 8-bit channel is organized as 2 signal lines for column commands and 3 signal lines for Row Commands.

All typical GDDR7 commands (except some rare bank-related ones) are 8 CA NRZ symbols.

GDDR7 uses:

Typical CK4 clocks like 1.75GHz, period ~0.57ns
2+3 Command&Address (CA) signal bits. While column & row commands are issued in parallel, the column command may be issued only after a row is activated (the parallel column command would target the previously selected row)
CA NRZ symbols been quad faster CK4 frequency.
DATA PAM3 symbols been octo faster CK4 frequency.
"ACTIVATE ROW xxxx" and "READ COLUMN xxxx" commands being 8 CA symbols, so 1 command per 2 CK4 period
The minimum burst read is 16 PAM3 symbols, so 2 CK4 periods for getting data

GDDR6X used:

Typical CK4 clocks like 2.5GHz, period ~0.4ns
10 Command&Address (CA) signal bits + AddressBusInversion bit
CA NRZ symbols been twice faster CK4 frequency.
DATA PAM4 symbols been quad faster CK4 frequency.
"ACTIVATE ROW xxxx" and "READ COLUMN xxxx" commands being 2 CA symbols, so 1 command per CK4 period
The minimum burst read is 8 DATA PAM4 symbols, so 2 CK4 periods for getting data

GDDR6 used:

Typical CK4 clocks like 1.75GHz, period ~0.57ns
10 Command&Address (CA) signal bits + AddressBusInversion bit
CA NRZ symbols been twice faster CK4 frequency.
DATA NRZ symbols been octo faster CK4 frequency.
"ACTIVATE ROW xxxx" and "READ COLUMN xxxx" commands being 2 CA symbols, so 1 command per CK4 period
The minimum burst read is 16 DATA NRZ symbols, so 2 CK4 periods for getting data

So, the semi random access within a single already activated row time is:

GDDR7: "READ COLUMN xxxx" (2 CK4) + timing delay + "Transmitting data" (2 CK4) 4 CK4 clocks total = 2.28 ns + timing delay
GDDR6X: "READ COLUMN xxxx" (1 CK4) + timing delay + "Transmitting data" (2 CK4) 3 CK4 clocks total = 1.2 ns + timing delay
GDDR6: "READ COLUMN xxxx" (1 CK4) + timing delay + "Transmitting data" (2 CK4) 3 CK4 clocks total = 1.71 ns + timing delay

However, the "timing delay" - between command issuing and execution - are typically much bigger in ns than above numbers; so those values "as is" is not so huge as far as I understand.
But it shows that supposing the ~same inner-IC timing between "end of decoding column address" and "start of outputting data" the GDDR7 have bigger delays. Those caused by its data transmission optimized for high throughput, not low latency.

From the other point of view, having each IC as independent-addressable 4 ICs would allow for random accessing more data in parallel)

Wirko · Jan 26, 2025

I have yet to chew through all that, I'll just note that LPDDR has a complicated system of latencies where read latency is in some cases different than write latency. I don't know if this applies to GDDR too but it might. Pay attention to that when looking for the right numbers in datasheets.

StViolenceDay said:
From the other point of view, having each IC as independent-addressable 4 ICs would allow for random accessing more data in parallel)

You mean ever narrower channels here? Yeah but the narrower the bus, the longer the burst length. Minimum unit of data transfer is 64 bytes just like in CPUs, right? Or do modern GPUs support smaller units too?

StViolenceDay · Jan 26, 2025

Wirko said:
You mean ever narrower channels here?

Having more indpendently addressable channels is good for massive-multi-threaded (like typical GPU tasks) random-access

Burst sizes are fixed 256 bits = 32bytes per adressable segment.

for GDDR7: can get 4x 32bytes from 4 addresable segmets for one random access cycle
for GDDR6/6x: can get 2x 32bytes from 2 addresable segments for one random access cycle

eidairaman1 · Jan 26, 2025

I knew gddr5 was quad pumped due to data rate 1750MHz*4=7000 or 7 Gbps gddr6 is octal pumped 1750MHz*8= 14000 or 14 Gbps

Eidolon · Jan 26, 2025

StViolenceDay said:
But it shows that supposing the ~same inner-IC timing between "end of decoding column address" and "start of outputting data" the GDDR7 have bigger delays.

That's interesting. To be fair, we're talking about a single nanosecond difference, which I assume is pretty insignificant. I wonder how much extra latency GDDR7 has just because of the extra step it needs to encode the data as PAM3 before transmission though.

StViolenceDay said:
Typical CK4 clocks like 2.5GHz, period ~0.4ns

I assumed that GDDR6X is clocked the same as GDDR6, just that NRZ is replaced by PAM4, but it seems like that's not the case? I'd think CK4 for 20 Gbps would be 1.25 GHz; the RTX 4090's 21 Gbps is said to be 1313 MHz after all. Do you know what's going on there?

Wirko · Jan 26, 2025

Eidolon said:
it needs to encode the data as PAM3

A wicked thought... What if the data in the memory cells is actually stored as three voltage levels? Experimental MLC DRAM (2 bits, 4 levels per cell) was done in the past, by Samsung I think.

StViolenceDay · Jan 26, 2025

Eidolon said:
I assumed that GDDR6X is clocked the same as GDDR6, just that NRZ is replaced by PAM4, but it seems like that's not the case? I'd think CK4 for 20 Gbps would be 1.25 GHz; the RTX 4090's 21 Gbps is said to be 1313 MHz after all. Do you know what's going on there?

Due to switch to PAM4 it has much lower symbol frequency compared to GDDR6's NRZ. I think due to complexity in keeping the "data eye" distinguishable on receive (if they'd be able to keep the symbol frequency the same - GDDR6X would be twice faster then GDDR6 - and this is not truth, its only ~1.5 times faster).

I'm not sure regarding the actual physical frequency of CK4, but according to all Micron presenatsions like https://my.micron.com/content/dam/m...ing-brief/gddr6x-pam4-2x-speed-tech-brief.pdf - the GDDR6X process 4 PAM4 symbols per 1 CK clock.

As a net result I think that "actual CK4 clock is 2x the value visualized by the monitoring tools for GDDR6X". This 2x maybe just hidden as a GPU implementation detail, so software sees 1.3 Ghz instead of 2.6Ghz

Wirko said:
What if the data in the memory cells is actually stored as three voltage levels?

The GDDR7 protocol specifies that every data block is accomodated by a CRC calculated on the binary bits, not "tri-level Trits". So if the values was indeed stored as trits - VRAM ic would have to decode it into bits to calculate such CRC. So, it seems that the need of the CRC calculation makes such implementaion too complex. But... If the memory IC would just store CRC recieived from GPU in Trits and just report i back without any checking - this may work. Even with storage in bits it seems that memory manufacturer can "cut corners" and just store+report back the CRC instead of checking+calculating it. Even if CRC would become dmaaged - this would be noticed on the GPU side. And its really not so important when the data was damaged - during GPU->VRAM transmission or on its way back.

Wirko · Jan 26, 2025

StViolenceDay said:
Even with storage in bits it seems that memory manufacturer can "cut corners" and just store+report back the CRC instead of checking+calculating it. Even if CRC would become dmaaged - this would be noticed on the GPU side. And its really not so important when the data was damaged - during GPU->VRAM transmission or on its way back.

That's normal, not some sort of cost cutting. The MC calculates everything related to ECC (be it parity, CRC, in-band, out-of-band, anything). The memory chips don't calculate anything.

I'll also leave this here, it might be helpful even if it's from 2017 and not related to GDDR7. It's a nicely written and easy to follow discussion of what's new in GDDR6.

GDDR6 Deep Dive

StViolenceDay · Jan 27, 2025

Wirko said:
The memory chips don't calculate anything

That was true. But GDDR7 intriduces CRC used for transmitting data (not storing data). And spec explicitly suggests that the VRAM IC should check it and report write error on miscompare:

The algorithm for getting 18 CRC bits is also mentioned: "The CRC computation is performed by two CRC blocks where each block applies the same CRC-9 (x9+x7+x4+x2+x+1) polynomial on even and odd bits of the burst separately"
Technicaly, if the transmission errors during writes would be extremly rare - the memory IC can skip this and just remember the to report it back.

Btw, while the idea of "its better to get explicit error instead of silent incorrect data" is good - its current implementation has a severe limitation: the data is cheked very carefully, but the addresses uses much weaker checking.

I performed an experiment with noise ijecting into GDDR6 lanes of AMD GPUs.The method was very simpe: remove some solder mask to get copper signal line acessible then inject noise by touching the line with a Ground-connected probe.
The result is following:

Touching any of the lines transmitting data - the DQ lines - caused immediate GPU emergency stop (black screen until reboot, even if its just displaying BIOS without any driver).
Touching some of address lines leads to "happily returning incorrect data" - memory errors in tests. From the transmission point of view that correct data, but from wrong address.

GDDR7 introduces a parity check on the command+address lines, but such parity check is very weak compared to 18 bit CRC check on data. However the command+address uses NRZ and lower frequency, so it is much less influenced by some error factors.

Wirko · Jan 27, 2025

I see. It means that the bit error rate on PAM3/PAM4 transmission is relatively high and must be mitigated by adding CRC (and re-transmission if there are errors?), because ECC after the write + read cycle is insufficient.
That must be quite a challenge in manufacturing DRAM chips. They contain a very small amount of logic because the fab process is very much inappropriate for that - but CRC checking seems to require a lot of logic.

Processor	R5 3600 @ 4.3 GHz
Motherboard	B450 Tomahawk Max
Cooling	Noctua NH-U14S
Memory	2x8 GB DDR4-3800 CL14
Video Card(s)	Arc A770 LE
Storage	Samsung 960 EVO
Benchmark Scores	https://www.3dmark.com/spy/40037915

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

System Name	PCGOD
Processor	AMD FX 8350@ 5.0GHz
Motherboard	Asus TUF 990FX Sabertooth R2 2901 Bios
Cooling	Scythe Ashura, 2×BitFenix 230mm Spectre Pro LED (Blue,Green), 2x BitFenix 140mm Spectre Pro LED
Memory	16 GB Gskill Ripjaws X 2133 (2400 OC, 10-10-12-20-20, 1T, 1.65V)
Video Card(s)	AMD Radeon 290 Sapphire Vapor-X
Storage	Samsung 840 Pro 256GB, WD Velociraptor 1TB
Display(s)	NEC Multisync LCD 1700V (Display Port Adapter)
Case	AeroCool Xpredator Evil Blue Edition
Audio Device(s)	Creative Labs Sound Blaster ZxR
Power Supply	Seasonic 1250 XM2 Series (XP3)
Mouse	Roccat Kone XTD
Keyboard	Roccat Ryos MK Pro
Software	Windows 7 Pro 64

Processor	R5 3600 @ 4.3 GHz
Motherboard	B450 Tomahawk Max
Cooling	Noctua NH-U14S
Memory	2x8 GB DDR4-3800 CL14
Video Card(s)	Arc A770 LE
Storage	Samsung 960 EVO
Benchmark Scores	https://www.3dmark.com/spy/40037915

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

How GDDR7 works compared to GDDR6

The Exiled Airman