Thursday, February 13th 2025

SanDisk Develops HBM Killer: High-Bandwidth Flash (HBF) Allows 4 TB of VRAM for AI GPUs

During its first post-Western Digital spinoff investor day, SanDisk showed something it has been working on to tackle the AI sector. High-bandwidth flash (HBF) is a new memory architecture that combines 3D NAND flash storage with bandwidth capabilities comparable to high-bandwidth memory (HBM). The HBF design stacks 16 3D NAND BiCS8 dies using through-silicon vias, with a logic layer enabling parallel access to memory sub-arrays. This configuration achieves 8 to 16 times greater capacity per stack than current HBM implementations. A system using eight HBF stacks can provide 4 TB of VRAM to store large AI models like GPT-4 directly on GPU hardware. The architecture breaks from conventional NAND design by implementing independently accessible memory sub-arrays, moving beyond traditional multi-plane approaches. While HBF surpasses HBM's capacity specifications, it maintains higher latency than DRAM, limiting its application to specific workloads.

SanDisk has not disclosed its solution for NAND's inherent write endurance limitations, though using pSLC NAND makes it possible to balance durability and cost. The bandwidth of HBF is also unknown, as the company hasn't put out details yet. SanDisk Memory Technology Chief Alper Ilkbahar confirmed the technology targets read-intensive AI inference tasks rather than latency-sensitive applications. The company is developing HBF as an open standard, incorporating mechanical and electrical interfaces similar to HBM to simplify integration. Some challenges remain, including NAND's block-level addressing limitations and writing endurance constraints. While these factors make HBF unsuitable for gaming applications, the technology's high capacity and throughput characteristics align with AI model storage and inference requirements. SanDisk has announced plans for three generations of HBF development, indicating a long-term commitment to the technology.
Source: via Tom's Hardware
Add your own comment

35 Comments on SanDisk Develops HBM Killer: High-Bandwidth Flash (HBF) Allows 4 TB of VRAM for AI GPUs

#1
Denver
Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.
Posted on Reply
#2
Wirko
AleksandarKSome challenges remain, including NAND's block-level addressing limitations and writing endurance constraints.
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
Posted on Reply
#3
AnotherReader
WirkoIt's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
It isn't just the granularity of transfer. DRAM has unlimited endurance; this, on the other hand, is unlikely to be much better than SLC.
Posted on Reply
#4
bonehead123
WirkoIt's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
AnotherReaderIt isn't just the granularity of transfer. DRAM has unlimited endurance; this, on the other hand, is unlikely to be much better than SLC.
Soooo...what ya'll are saying is that there won't be any 4TB, $25K GPU's for da gamrz to drool over, at least not for a while anyways ?

Aw so sad :D


n.O.t.....
Posted on Reply
#5
qlum
I assume it is mostly meant for Large AI models, which require quite a lot of vram to run. Performance as memory will not be great but if it's performant enough, with the dram on top, it may very well be good enough.
If so good development to bring costs down for these.
Posted on Reply
#6
andrehide
Previous attempts to use non-RAM as RAM failed.
The most famous one was Intel/Micron Optane/3D XPoint.
It doesn't seem that this one will do any better.
Posted on Reply
#7
Wirko
bonehead123Soooo...what ya'll are saying is that there won't be any 4TB, $25K GPU's for us gamrz to drool over, at least not for a while anyways ?

Aw so sad :D


n.O.t.....
The trend is obviously towards 8GB, $25K GPUs, but the frog tastes best if cooked slowly.
DenverCorrect me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.
The models would only be updated occasionally, that's the idea, so writing wouldn't be much of a problem. But limited read endurance is also sometimes hinted at. I don't know how much research has been done around read degradation, and whether it's relevant. Anyway, processing needs RAM too, a couple hundred MB of static RAM cache can't suffice for that, so inevitably some HBM will be part of the system too.
Posted on Reply
#8
LabRat 891
This HBF looks 'useful', but not on its own.
Inb4 tiered memory standards for Compute/Graphics?

Top: L1-3 caches
Upper: HBM
Lower: HBF
Bottom: NAND

Stack it 'till it's cheap :laugh:
Posted on Reply
#9
Assimilator
Yeah, no. NAND flash is not RAM, it is designed for entirely different usage patterns, and the notion that it could be used as a replacement for RAM is nonsensical. Considering GPUs already have effectively direct access to storage via APIs like DirectStorage, I see no use-case for this technology.
Posted on Reply
#10
evernessince
DenverCorrect me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.
It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Posted on Reply
#11
LabRat 891
evernessinceIt depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Looking towards 'applications' in a given 'product',
maybe, parts of a Model can better utilize different kinds of storage?

I'm thinking:
"Working memory" Cache, HBM, RAM.
"Short-Term Memory" HBF, XLflash, phase-change memory, massively parallelized (p)SLC NAND.
"Long-Term Memory" TLC and QLC NAND.
"Archival Memory" HDDs and Magnetic Tape.
Posted on Reply
#12
ScaLibBDP
DenverCorrect me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.
There are two cases: Training ( write ops ) and Inference ( read ops, where they intend to use HBF ). Its overall indurance depends on Terrabyte Written ( TBW ).

Also, that new technology could affect progress of CXL Memory Expanders ( very expensive stuff right now ). 4TB inside of a GPU is a lot of memory for processing!
Posted on Reply
#13
Wirko
evernessinceIt depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Non-volatility means little, or nothing, in this kind of applications. The processors will crunch vectors and matrices without interruption until they're too old and can't make enough money anymore. (Well, low power and sleep states probably exist too, since all processors can't be fully loaded all of the time.)
ScaLibBDPAlso, that new technology could affect progress of CXL Memory Expanders ( very expensive stuff right now ).
I don't see a close connection. CXL is PCIe, which is up to 16 lanes of Gen 6 (maybe soon in AI datacenters) or Gen 7 (a few years out). That's infinitely slower than several stacks of on-package HBM/HBF optimised for maximum bandwidth and maximum cost.
Posted on Reply
#14
evernessince
WirkoNon-volatility means little, or nothing, in this kind of applications. The processors will crunch vectors and matrices without interruption until they're too old and can't make enough money anymore. (Well, low power and sleep states probably exist too, since all processors can't be fully loaded all of the time.)
Non-volatile memory yields power and cost savings. There are dozens of articles on the topic: www.embedded.com/the-benefit-of-non-volatile-memory-nvm-for-edge-ai/

www.forbes.com/sites/tomcoughlin/2023/09/06/emerging-non-volatile-memories-enable-iot-and-ai-growth/

It allows you to take fetches that would otherwise be to main system memory or mass storage and put them right on the chip. This lowers latency and power consumption. In addition, flash doesn't needed to be constantly refreshed when not actively in use so you can very aggressively power tune it. This is simply not possible with volatile memory that needs to be refreshed to maintain data.

I believe LabRat 891 put it perfectly, it makes sense as another layer in the memory subsystem designed to hold a specific set of data and the overall workload will see a very nice benefit as a result.
Posted on Reply
#15
dont whant to set it"'
It can operate at what bandwidth dear sir?Bandwidth is in its name after all.
1bit per second, arbitrary value.
Posted on Reply
#16
InVasMani
If it isn't readily serviceable and replaceable the NAND seems like a serious e-waste concern for the rest of the hardware if the NAND degrades too quickly. It might be acceptable for AI depending on longevity, but probably not so much otherwise.
Posted on Reply
#17
kondamin
Anther reminder it was dumb to kill off xpoint, it would have been as in demand as hbm for the last 2 years
Posted on Reply
#18
Wirko
evernessinceNon-volatile memory yields power and cost savings. There are dozens of articles on the topic: www.embedded.com/the-benefit-of-non-volatile-memory-nvm-for-edge-ai/

www.forbes.com/sites/tomcoughlin/2023/09/06/emerging-non-volatile-memories-enable-iot-and-ai-growth/

It allows you to take fetches that would otherwise be to main system memory or mass storage and put them right on the chip. This lowers latency and power consumption. In addition, flash doesn't needed to be constantly refreshed when not actively in use so you can very aggressively power tune it. This is simply not possible with volatile memory that needs to be refreshed to maintain data.

I believe LabRat 891 put it perfectly, it makes sense as another layer in the memory subsystem designed to hold a specific set of data and the overall workload will see a very nice benefit as a result.
I don't disagree, NAND does have some advantages, but non-volatility by itself is not important unless and until power goes out. A theoretical volatile NAND with extremely low idle power (similar to SRAM) would do this job just as well, that's my point.
kondaminAnther reminder it was dumb to kill off xpoint, it would have been as in demand as hbm for the last 2 years
We can't be sure it's dead. The development continues somewhere deep under the ground and will continue until all patents expire. TI (or whoever) may succeed in developing a method to expand those 4 layers to 100+ ... but it's not a given.
dont whant to set it'It can operate at what bandwidth dear sir?Bandwidth is in its name after all.
1bit per second, arbitrary value.
HBM sends the data around at about 6400 MT/s, and NAND does it at 3200 MT/s. So, as a quick estimate, half of HBM's bandwidth would be possible with the technology we already have.
Posted on Reply
#19
lexluthermiester
AleksandarKSanDisk Develops HBM Killer: High-Bandwidth Flash (HBF)
Wait, what?...
AleksandarKWhile HBF surpasses HBM's capacity specifications, it maintains higher latency than DRAM, limiting its application to specific workloads.
So this is NAND... HBM is DRAM...

NOT an HBM killer. That was a very click-bait headline. Come on peeps, TPU is better than that crap..
Posted on Reply
#20
AleksandarK
News Editor
lexluthermiesterWait, what?...

So this is NAND... HBM is DRAM...

NOT an HBM killer. That was a very click-bait headline. Come on peeps, TPU is better than that crap..
For AI workloads its HBM killer(despite being not the same tech fundamentally). Imagine you load an entire model on a single GPU. You don't need top-tier low-latency.
Posted on Reply
#21
InVasMani
If they can somehow do this on a M.2 attached to a GPU and get similar results it would be great. If it's just replacing volatile VRAM with NAND and questionable endurance concerns maybe not as exciting. From a business standpoint it could still make a lot of sense though if the economics of it make enough sense in terms of profitability.
Posted on Reply
#22
lexluthermiester
AleksandarKFor AI workloads its HBM killer(despite being not the same tech fundamentally). Imagine you load an entire model on a single GPU. You don't need top-tier low-latency.
While that's a fair point, I was referring to the durability factor. NAND wears out, and under these kinds of load, would wear out swiftly. This is fact and can not be argued. DRAM does not wear out.

That was my point. For that reason alone, HBF is NOT an HBM killer. Until we have a major break-through in NAND flash durability it will not change. All Sandisk has done is create mildly and temporarily useful E-waste.
Posted on Reply
#23
Solid State Brain
They're already saying it's for AI inference, i.e. mostly read-centric workloads where most of the bandwidth utilization is reading model weights (in the hundreds of gigabytes to few terabytes range). Nothing prohibits hardware manufacturers from putting VRAM or HBM alongside the HBF for memory content that needs to be frequently modified (mainly the key-value cache during token generation).
Posted on Reply
#24
lexluthermiester
Solid State BrainThey're already saying it's for AI inference, i.e. mostly read-centric workloads where most of the bandwidth utilization is reading model weights (in the hundreds of gigabytes to few terabytes range).
While that is a reasonable point, NAND flash simply doesn't have the durability to be useful long term in such a way.
Solid State BrainNothing prohibits hardware manufacturers from putting VRAM or HBM alongside the HBF for memory content that needs to be frequently modified (mainly the key-value cache during token generation).
Another reasonable point, however, that was not the claim made in the above article.
Posted on Reply
#25
Solid State Brain
lexluthermiesterWhile that is a reasonable point, NAND flash simply doesn't have the durability to be useful long term in such a way.
What makes you think so? LLM weights (at least as of now) are static and once loaded in memory they won't need to be modified unless you need to replace them entirely with something else. Since datacenter GPUs will basically never be turned off and the HBF isn't going to store irreplaceable data anyway (the weights will likely be first read from slower long-term storage devices), data retention doesn't need to be very long, and this will increase the number of write/erase cycles allowed.
lexluthermiesterAnother reasonable point, however, that was not the claim made in the above article.
The linked original presentation from SanDisk is showing one such configurations on page 99:
documents.sandisk.com/content/dam/asset-library/en_us/assets/public/sandisk/corporate/Sandisk-Investor-Day_2025.pdf
Posted on Reply
Add your own comment
Feb 18th, 2025 19:56 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts