Panmnesia Uses CXL Protocol to Expand GPU Memory with Add-in DRAM Card or Even SSD

AleksandarK · Jul 2, 2024

South Korean startup Panmnesia has unveiled an interesting solution to address the memory limitations of modern GPUs. The company has developed a low-latency Compute Express Link (CXL) IP that could help expand GPU memory with external add-in card. Current GPU-accelerated applications in AI and HPC are constrained by the set amount of memory built into GPUs. With data sizes growing by 3x yearly, GPU networks must keep getting larger just to fit the application in the local memory, benefiting latency and token generation. Panmnesia's proposed approach to fix this leverages the CXL protocol to expand GPU memory capacity using PCIe-connected DRAM or even SSDs. The company has overcome significant technical hurdles, including the absence of CXL logic fabric in GPUs and the limitations of existing unified virtual memory (UVM) systems.

At the heart of Panmnesia's solution is a CXL 3.1-compliant root complex with multiple root ports and a host bridge featuring a host-managed device memory (HDM) decoder. This sophisticated system effectively tricks the GPU's memory subsystem into treating PCIe-connected memory as native system memory. Extensive testing has demonstrated impressive results. Panmnesia's CXL solution, CXL-Opt, achieved two-digit nanosecond round-trip latency, significantly outperforming both UVM and earlier CXL prototypes. In GPU kernel execution tests, CXL-Opt showed execution times up to 3.22 times faster than UVM. Older CXL memory extenders recorded around 250 nanoseconds round trip latency, with CXL-Opt potentially achieving less than 80 nanoseconds. As with CXL, the problem is usually that the memory pools add up latency and performance degrades, while these CXL extenders tend to add to the cost model as well. However, the Panmnesia CXL-Opt could find a use case, and we are waiting to see if anyone adopts this in their infrastructure.

Below are some benchmarks by Panmnesia, as well as the architecture of the CXL-Opt.

View at TechPowerUp Main Site | Source

alphaLONE · Jul 2, 2024

that's at most what, 128GB/s on 16x Gen 5 PCIe? really not much for a big GPU, that's even less than what the RX 6500 XT has.

ScaLibBDP · Jul 2, 2024

alphaLONE said:
that's at most what, 128GB/s on 16x Gen 5 PCIe? really not much for a big GPU, that's even less than what the RX 6500 XT has.

That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

Of course, as faster as possible memory interfaces are better.

Yashyyyk · Jul 2, 2024

I think the Phison AI100E / aiDAPTIV+ is more practical for most people, hope to see coverage / testing on that

Ferrum Master · Jul 2, 2024

Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.

Wirko · Jul 2, 2024

But that name ... If someone had asked me yesterday what "panmnesia" means, I'd answer that it's a situation where everyone forgets everything. (Or should that mean everyone except AI?)

ScaLibBDP said:
That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

Of course, as faster as possible memory interfaces are better.

But this is exactly that, if I understand its purpose well. It's low-latency memory that's shared between nodes, and it becomes part of each GPU's memory space.

Also, do any modern GPU+CPU architectures exist that can actually share memory between nodes, the way a multi-socket CPU system does?

natr0n · Jul 2, 2024

now I can fix this pos 8gb 3070ti and say eat sh!t jensen

LabRat 891 · Jul 3, 2024

Oh sweet summer children... this will never come to the consumer sector. :roll:

Sad, since this would almost be a good excuse all its own for Gen6-> PCI-E in the Consumer Market.

Besides adding resources to GPUs and CPUs, being able to address relatively large amounts of prev.-gen. 'surplus' RAM as NVMe-like (RAMdrive) storage/cache would be useful. [Both in the Enthusiast-Consumer world, and Industry]

NtM, if Intel hasn't completely abandoned Optane; they could easily reinvigorate interest.
Offering Intel-licensed Pmem Cards (optionally, utilizing once platform-propietary P-DIMMs) over CXL, would greatly broaden the potential market. Esp. w/ the newfound interests in "AI-ing every-thing" :laugh:

Minus Infinity · Jul 3, 2024

Will never be supported for desktop dGPUs so forget it, and it's also not coming any time soon.

GPU makers should add DDR5 memory slots on the consumer GPU so we can expand memory and still have good latency compared to any on MB solution.

ty_ger · Jul 3, 2024

ScaLibBDP said:
That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

There's a problem with your explanation. You say its not a big problem to move one big chunk slowly once, because then the data is on the GPU to be processed there. This is different. This is one big chunk next to the GPU, which will then be processed in many small chunks over the slow bus. It's effectively moving the data around on the slow bus constantly, because this is a product designed to be used with GPUs which don't have enough onboard VRAM.

watzupken · Jul 3, 2024

Minus Infinity said:
Will never be supported for desktop dGPUs so forget it, and it's also not coming any time soon.

GPU makers should add DDR5 memory slots on the consumer GPU so we can expand memory and still have good latency compared to any on MB solution.

I don't think this will ever happen because,
1. The likes of Nvidia will never allow it and they have an iron reign over these AIBs.
2. Such option will deprive them of higher revenue/ profit margin since it allows you buy a lower end model and increase the RAM.

Minus Infinity · Jul 3, 2024

watzupken said:
I don't think this will ever happen because,
1. The likes of Nvidia will never allow it and they have an iron reign over these AIBs.
2. Such option will deprive them of higher revenue/ profit margin since it allows you buy a lower end model and increase the RAM.

Oh indeed, but I can dream and it would be a simple option for consumer GPUs. This CXL stuff is for workstation+ class GPU.

Nvidia could of course stop gimping their GPUs and pretending L2 cache is the answer.

enb141 · Jul 3, 2024

Ferrum Master said:
Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.

That was in 1997 (in my case) when I added RAM to my ATI GPU back then.

Dahita · Jul 3, 2024

This is great, we're finally going to be able to play Crysis at over 30fps in 1080p.

TechLurker · Jul 7, 2024

Ferrum Master said:
Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.

For a short period of time, AMD also experimented with their Radeon Pro SSG cards, which included a user-upgradable NVMe drive and provided up to 2TB worth of video card memory.

There were some niche use-cases for it, and there were also attempts by some hardcore enthusiasts to try and access it to install games onto.

Would be interesting if AMD could bring it back for newer datacenter Accelerators as well as even for top-level gaming cards, making full use of PCIe 4.0 bandwidth or even PCIe 5.0 bandwidth to either use the SSDs as extra storage or internally to speed up memory use somehow.

System Name	HELLSTAR
Processor	AMD RYZEN 9 5950X
Motherboard	ASUS Strix X570-E
Cooling	2x 360 + 280 rads. 3x Gentle Typhoons, 3x Phanteks T30, 2x TT T140 . EK-Quantum Momentum Monoblock.
Memory	4x8GB G.SKILL Trident Z RGB F4-4133C19D-16GTZR 14-16-12-30-44
Video Card(s)	Sapphire Pulse RX 7900XTX. Water block. Crossflashed.
Storage	Optane 900P[Fedora] + WD BLACK SN850X 4TB + 750 EVO 500GB + 1TB 980PRO+SN560 1TB(W11)
Display(s)	Philips PHL BDM3270 + Acer XV242Y
Case	Lian Li O11 Dynamic EVO
Audio Device(s)	SMSL RAW-MDA1 DAC
Power Supply	Fractal Design Newton R3 1000W
Mouse	Razer Basilisk
Keyboard	Razer BlackWidow V3 - Yellow Switch
Software	FEDORA 41

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

System Name	natr0n-PC
Processor	Ryzen 5950x-5600x \| 9600k
Motherboard	B450 AORUS M \| Z390 UD
Cooling	EK AIO 360 - 6 fan action \| AIO
Memory	Patriot - Viper Steel DDR4 (B-Die)(4x8GB) \| Samsung DDR4 (4x8GB)
Video Card(s)	EVGA 3070ti FTW
Storage	Various
Display(s)	Pixio PX279 Prime
Case	Thermaltake Level 20 VT \| Black bench
Audio Device(s)	LOXJIE D10 + Kinter Amp + 6 Bookshelf Speakers Sony+JVC+Sony
Power Supply	Super Flower Leadex III ARGB 80+ Gold 650W \| EVGA 700 Gold
Software	XP/7/8.1/10
Benchmark Scores	http://valid.x86.fr/79kuh6

System Name	Sleepy Painter
Processor	AMD Ryzen 5 3600
Motherboard	Asus TuF Gaming X570-PLUS/WIFI
Cooling	FSP Windale 6 - Passive
Memory	2x16GB F4-3600C16-16GVKC @ 16-19-21-36-58-1T
Video Card(s)	MSI RX580 8GB
Storage	2x Samsung PM963 960GB nVME RAID0, Crucial BX500 1TB SATA, WD Blue 3D 2TB SATA
Display(s)	Microboard 32" Curved 1080P 144hz VA w/ Freesync
Case	NZXT Gamma Classic Black
Audio Device(s)	Asus Xonar D1
Power Supply	Rosewill 1KW on 240V@60hz
Mouse	Logitech MX518 Legend
Keyboard	Red Dragon K552
Software	Windows 10 Enterprise 2019 LTSC 1809 17763.1757

System Name	Z77 Rev. 1
Processor	Intel Core i7 3770K
Motherboard	ASRock Z77 Extreme4
Cooling	Water Cooling
Memory	2x G.Skill F3-2400C10D-16GTX
Video Card(s)	EVGA GTX 1080
Storage	Samsung 850 Pro
Display(s)	Samsung 28" UE590 UHD
Case	Silverstone TJ07
Audio Device(s)	Onboard
Power Supply	Seasonic PRIME 600W Titanium
Mouse	EVGA TORQ X10
Keyboard	Leopold Tenkeyless
Software	Windows 10 Pro 64-bit
Benchmark Scores	3DMark Time Spy: 7695

Panmnesia Uses CXL Protocol to Expand GPU Memory with Add-in DRAM Card or Even SSD

AleksandarK

News Editor

alphaLONE

New Member

ScaLibBDP

Yashyyyk

Ferrum Master

Wirko

natr0n

LabRat 891

Minus Infinity

ty_ger

watzupken

Minus Infinity

enb141

Dahita

TechLurker