Tuesday, July 2nd 2024

Panmnesia Uses CXL Protocol to Expand GPU Memory with Add-in DRAM Card or Even SSD

South Korean startup Panmnesia has unveiled an interesting solution to address the memory limitations of modern GPUs. The company has developed a low-latency Compute Express Link (CXL) IP that could help expand GPU memory with external add-in card. Current GPU-accelerated applications in AI and HPC are constrained by the set amount of memory built into GPUs. With data sizes growing by 3x yearly, GPU networks must keep getting larger just to fit the application in the local memory, benefiting latency and token generation. Panmnesia's proposed approach to fix this leverages the CXL protocol to expand GPU memory capacity using PCIe-connected DRAM or even SSDs. The company has overcome significant technical hurdles, including the absence of CXL logic fabric in GPUs and the limitations of existing unified virtual memory (UVM) systems.

At the heart of Panmnesia's solution is a CXL 3.1-compliant root complex with multiple root ports and a host bridge featuring a host-managed device memory (HDM) decoder. This sophisticated system effectively tricks the GPU's memory subsystem into treating PCIe-connected memory as native system memory. Extensive testing has demonstrated impressive results. Panmnesia's CXL solution, CXL-Opt, achieved two-digit nanosecond round-trip latency, significantly outperforming both UVM and earlier CXL prototypes. In GPU kernel execution tests, CXL-Opt showed execution times up to 3.22 times faster than UVM. Older CXL memory extenders recorded around 250 nanoseconds round trip latency, with CXL-Opt potentially achieving less than 80 nanoseconds. As with CXL, the problem is usually that the memory pools add up latency and performance degrades, while these CXL extenders tend to add to the cost model as well. However, the Panmnesia CXL-Opt could find a use case, and we are waiting to see if anyone adopts this in their infrastructure.
Below are some benchmarks by Panmnesia, as well as the architecture of the CXL-Opt.

Sources: Panmnesia, via Tom's Hardware
Add your own comment

14 Comments on Panmnesia Uses CXL Protocol to Expand GPU Memory with Add-in DRAM Card or Even SSD

#1
alphaLONE
that's at most what, 128GB/s on 16x Gen 5 PCIe? really not much for a big GPU, that's even less than what the RX 6500 XT has.
Posted on Reply
#2
ScaLibBDP
alphaLONEthat's at most what, 128GB/s on 16x Gen 5 PCIe? really not much for a big GPU, that's even less than what the RX 6500 XT has.
That is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

Of course, as faster as possible memory interfaces are better.
Posted on Reply
#3
Yashyyyk
I think the Phison AI100E / aiDAPTIV+ is more practical for most people, hope to see coverage / testing on that
Posted on Reply
#4
Ferrum Master
Everthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.
Posted on Reply
#5
Wirko
But that name ... If someone had asked me yesterday what "panmnesia" means, I'd answer that it's a situation where everyone forgets everything. (Or should that mean everyone except AI?)
ScaLibBDPThat is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!

Of course, as faster as possible memory interfaces are better.
But this is exactly that, if I understand its purpose well. It's low-latency memory that's shared between nodes, and it becomes part of each GPU's memory space.

Also, do any modern GPU+CPU architectures exist that can actually share memory between nodes, the way a multi-socket CPU system does?
Posted on Reply
#6
natr0n
now I can fix this pos 8gb 3070ti and say eat sh!t jensen
Posted on Reply
#7
LabRat 891
Oh sweet summer children... this will never come to the consumer sector. :roll:
Sad, since this would almost be a good excuse all its own for Gen6-> PCI-E in the Consumer Market.

Besides adding resources to GPUs and CPUs, being able to address relatively large amounts of prev.-gen. 'surplus' RAM as NVMe-like (RAMdrive) storage/cache would be useful. [Both in the Enthusiast-Consumer world, and Industry]

NtM, if Intel hasn't completely abandoned Optane; they could easily reinvigorate interest.
Offering Intel-licensed Pmem Cards (optionally, utilizing once platform-propietary P-DIMMs) over CXL, would greatly broaden the potential market. Esp. w/ the newfound interests in "AI-ing every-thing" :laugh:
Posted on Reply
#8
Minus Infinity
Will never be supported for desktop dGPUs so forget it, and it's also not coming any time soon.

GPU makers should add DDR5 memory slots on the consumer GPU so we can expand memory and still have good latency compared to any on MB solution.
Posted on Reply
#9
ty_ger
ScaLibBDPThat is Not a huge problem. When it comes to Big Data Processing, in HPC, in AI, etc, if a GPU cluster doesn't support a Unified Memory Architecture ( UMA ), when CPUs and GPUs do Not share RAM of a system, developers try to move as bigger as possible chunk of data to the GPU memory and after that do processing that could be a very long ( seconds, minutes, etc ). It means, that too some degree memory bandwidth is less important. It is a very important to do processing with as bigger as possible chunk of data!
There's a problem with your explanation. You say its not a big problem to move one big chunk slowly once, because then the data is on the GPU to be processed there. This is different. This is one big chunk next to the GPU, which will then be processed in many small chunks over the slow bus. It's effectively moving the data around on the slow bus constantly, because this is a product designed to be used with GPUs which don't have enough onboard VRAM.
Posted on Reply
#10
watzupken
Minus InfinityWill never be supported for desktop dGPUs so forget it, and it's also not coming any time soon.

GPU makers should add DDR5 memory slots on the consumer GPU so we can expand memory and still have good latency compared to any on MB solution.
I don't think this will ever happen because,
1. The likes of Nvidia will never allow it and they have an iron reign over these AIBs.
2. Such option will deprive them of higher revenue/ profit margin since it allows you buy a lower end model and increase the RAM.
Posted on Reply
#11
Minus Infinity
watzupkenI don't think this will ever happen because,
1. The likes of Nvidia will never allow it and they have an iron reign over these AIBs.
2. Such option will deprive them of higher revenue/ profit margin since it allows you buy a lower end model and increase the RAM.
Oh indeed, but I can dream and it would be a simple option for consumer GPUs. This CXL stuff is for workstation+ class GPU.

Nvidia could of course stop gimping their GPUs and pretending L2 cache is the answer.
Posted on Reply
#12
enb141
Ferrum MasterEverthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.
That was in 1997 (in my case) when I added RAM to my ATI GPU back then.
Posted on Reply
#13
Dahita
This is great, we're finally going to be able to play Crysis at over 30fps in 1080p.
Posted on Reply
#14
TechLurker
Ferrum MasterEverthing goes in circles.

It must be over two decades, when I socketed additional RAM in my GPU. Not sure if it was Matrox or ATI.

But idea of L4 esque RAM pool for GPU? Killing the premium margin selling pro GPUs? It will not happen on large scale. They will not allow it.
For a short period of time, AMD also experimented with their Radeon Pro SSG cards, which included a user-upgradable NVMe drive and provided up to 2TB worth of video card memory.

There were some niche use-cases for it, and there were also attempts by some hardcore enthusiasts to try and access it to install games onto.

Would be interesting if AMD could bring it back for newer datacenter Accelerators as well as even for top-level gaming cards, making full use of PCIe 4.0 bandwidth or even PCIe 5.0 bandwidth to either use the SSDs as extra storage or internally to speed up memory use somehow.
Posted on Reply
Add your own comment
Dec 18th, 2024 06:00 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts