• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

SanDisk Develops HBM Killer: High-Bandwidth Flash (HBF) Allows 4 TB of VRAM for AI GPUs

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,798 (1.02/day)
During its first post-Western Digital spinoff investor day, SanDisk showed something it has been working on to tackle the AI sector. High-bandwidth flash (HBF) is a new memory architecture that combines 3D NAND flash storage with bandwidth capabilities comparable to high-bandwidth memory (HBM). The HBF design stacks 16 3D NAND BiCS8 dies using through-silicon vias, with a logic layer enabling parallel access to memory sub-arrays. This configuration achieves 8 to 16 times greater capacity per stack than current HBM implementations. A system using eight HBF stacks can provide 4 TB of VRAM to store large AI models like GPT-4 directly on GPU hardware. The architecture breaks from conventional NAND design by implementing independently accessible memory sub-arrays, moving beyond traditional multi-plane approaches. While HBF surpasses HBM's capacity specifications, it maintains higher latency than DRAM, limiting its application to specific workloads.

SanDisk has not disclosed its solution for NAND's inherent write endurance limitations, though using pSLC NAND makes it possible to balance durability and cost. The bandwidth of HBF is also unknown, as the company hasn't put out details yet. SanDisk Memory Technology Chief Alper Ilkbahar confirmed the technology targets read-intensive AI inference tasks rather than latency-sensitive applications. The company is developing HBF as an open standard, incorporating mechanical and electrical interfaces similar to HBM to simplify integration. Some challenges remain, including NAND's block-level addressing limitations and writing endurance constraints. While these factors make HBF unsuitable for gaming applications, the technology's high capacity and throughput characteristics align with AI model storage and inference requirements. SanDisk has announced plans for three generations of HBF development, indicating a long-term commitment to the technology.



View at TechPowerUp Main Site | Source
 
Joined
Jan 3, 2021
Messages
3,851 (2.55/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Some challenges remain, including NAND's block-level addressing limitations and writing endurance constraints.
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
 
Joined
Nov 26, 2021
Messages
1,824 (1.54/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
It isn't just the granularity of transfer. DRAM has unlimited endurance; this, on the other hand, is unlikely to be much better than SLC.
 
Joined
Oct 18, 2013
Messages
6,429 (1.55/day)
Location
So close that even your shadow can't see me !
System Name The Little One
Processor i5-11320H @4.4GHZ
Motherboard AZW SEI
Cooling Fan w/heat pipes + side & rear vents
Memory 64GB Crucial DDR4-3200 (2x 32GB)
Video Card(s) Iris XE
Storage WD Black SN850X 8TB m.2, Seagate 2TB SSD + SN850 8TB x2 in an external enclosure
Display(s) 2x Samsung 43" & 2x 32"
Case Practically identical to a mac mini, just purrtier in slate blue, & with 3x usb ports on the front !
Audio Device(s) Yamaha ATS-1060 Bluetooth Soundbar & Subwoofer
Power Supply 65w brick
Mouse Logitech MX Master 2
Keyboard Logitech G613 mechanical wireless
VR HMD Whahdatiz ???
Software Windows 10 pro, with all the unnecessary background shitzu turned OFF !
Benchmark Scores PDQ
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
It isn't just the granularity of transfer. DRAM has unlimited endurance; this, on the other hand, is unlikely to be much better than SLC.
Soooo...what ya'll are saying is that there won't be any 4TB, $25K GPU's for da gamrz to drool over, at least not for a while anyways ?

Aw so sad :D


n.O.t.....
 
Last edited:
Joined
Dec 29, 2020
Messages
228 (0.15/day)
I assume it is mostly meant for Large AI models, which require quite a lot of vram to run. Performance as memory will not be great but if it's performant enough, with the dram on top, it may very well be good enough.
If so good development to bring costs down for these.
 
Joined
Jul 28, 2020
Messages
8 (0.00/day)
Previous attempts to use non-RAM as RAM failed.
The most famous one was Intel/Micron Optane/3D XPoint.
It doesn't seem that this one will do any better.
 
Joined
Jan 3, 2021
Messages
3,851 (2.55/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Soooo...what ya'll are saying is that there won't be any 4TB, $25K GPU's for us gamrz to drool over, at least not for a while anyways ?

Aw so sad :D


n.O.t.....
The trend is obviously towards 8GB, $25K GPUs, but the frog tastes best if cooked slowly.

Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.
The models would only be updated occasionally, that's the idea, so writing wouldn't be much of a problem. But limited read endurance is also sometimes hinted at. I don't know how much research has been done around read degradation, and whether it's relevant. Anyway, processing needs RAM too, a couple hundred MB of static RAM cache can't suffice for that, so inevitably some HBM will be part of the system too.
 
Joined
Apr 18, 2019
Messages
2,510 (1.18/day)
Location
Olympia, WA
System Name Sleepy Painter
Processor AMD Ryzen 5 3600
Motherboard Asus TuF Gaming X570-PLUS/WIFI
Cooling FSP Windale 6 - Passive
Memory 2x16GB F4-3600C16-16GVKC @ 16-19-21-36-58-1T
Video Card(s) MSI RX580 8GB
Storage 2x Samsung PM963 960GB nVME RAID0, Crucial BX500 1TB SATA, WD Blue 3D 2TB SATA
Display(s) Microboard 32" Curved 1080P 144hz VA w/ Freesync
Case NZXT Gamma Classic Black
Audio Device(s) Asus Xonar D1
Power Supply Rosewill 1KW on 240V@60hz
Mouse Logitech MX518 Legend
Keyboard Red Dragon K552
Software Windows 10 Enterprise 2019 LTSC 1809 17763.1757
This HBF looks 'useful', but not on its own.
Inb4 tiered memory standards for Compute/Graphics?

Top: L1-3 caches
Upper: HBM
Lower: HBF
Bottom: NAND

Stack it 'till it's cheap :laugh:
 
Joined
Feb 18, 2005
Messages
6,119 (0.84/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) Dell S3221QS(A) (32" 38x21 60Hz) + 2x AOC Q32E2N (32" 25x14 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G604
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
Yeah, no. NAND flash is not RAM, it is designed for entirely different usage patterns, and the notion that it could be used as a replacement for RAM is nonsensical. Considering GPUs already have effectively direct access to storage via APIs like DirectStorage, I see no use-case for this technology.
 
Joined
Jul 13, 2016
Messages
3,505 (1.12/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) JDS Element IV, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse PMM P-305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.

It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
 
Joined
Apr 18, 2019
Messages
2,510 (1.18/day)
Location
Olympia, WA
System Name Sleepy Painter
Processor AMD Ryzen 5 3600
Motherboard Asus TuF Gaming X570-PLUS/WIFI
Cooling FSP Windale 6 - Passive
Memory 2x16GB F4-3600C16-16GVKC @ 16-19-21-36-58-1T
Video Card(s) MSI RX580 8GB
Storage 2x Samsung PM963 960GB nVME RAID0, Crucial BX500 1TB SATA, WD Blue 3D 2TB SATA
Display(s) Microboard 32" Curved 1080P 144hz VA w/ Freesync
Case NZXT Gamma Classic Black
Audio Device(s) Asus Xonar D1
Power Supply Rosewill 1KW on 240V@60hz
Mouse Logitech MX518 Legend
Keyboard Red Dragon K552
Software Windows 10 Enterprise 2019 LTSC 1809 17763.1757
It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Looking towards 'applications' in a given 'product',
maybe, parts of a Model can better utilize different kinds of storage?

I'm thinking:
"Working memory" Cache, HBM, RAM.
"Short-Term Memory" HBF, XLflash, phase-change memory, massively parallelized (p)SLC NAND.
"Long-Term Memory" TLC and QLC NAND.
"Archival Memory" HDDs and Magnetic Tape.
 
Last edited:
Joined
Jan 2, 2019
Messages
189 (0.08/day)
Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.

There are two cases: Training ( write ops ) and Inference ( read ops, where they intend to use HBF ). Its overall indurance depends on Terrabyte Written ( TBW ).

Also, that new technology could affect progress of CXL Memory Expanders ( very expensive stuff right now ). 4TB inside of a GPU is a lot of memory for processing!
 
Joined
Jan 3, 2021
Messages
3,851 (2.55/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Non-volatility means little, or nothing, in this kind of applications. The processors will crunch vectors and matrices without interruption until they're too old and can't make enough money anymore. (Well, low power and sleep states probably exist too, since all processors can't be fully loaded all of the time.)

Also, that new technology could affect progress of CXL Memory Expanders ( very expensive stuff right now ).
I don't see a close connection. CXL is PCIe, which is up to 16 lanes of Gen 6 (maybe soon in AI datacenters) or Gen 7 (a few years out). That's infinitely slower than several stacks of on-package HBM/HBF optimised for maximum bandwidth and maximum cost.
 
Joined
Jul 13, 2016
Messages
3,505 (1.12/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) JDS Element IV, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse PMM P-305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
Non-volatility means little, or nothing, in this kind of applications. The processors will crunch vectors and matrices without interruption until they're too old and can't make enough money anymore. (Well, low power and sleep states probably exist too, since all processors can't be fully loaded all of the time.)

Non-volatile memory yields power and cost savings. There are dozens of articles on the topic: https://www.embedded.com/the-benefit-of-non-volatile-memory-nvm-for-edge-ai/


It allows you to take fetches that would otherwise be to main system memory or mass storage and put them right on the chip. This lowers latency and power consumption. In addition, flash doesn't needed to be constantly refreshed when not actively in use so you can very aggressively power tune it. This is simply not possible with volatile memory that needs to be refreshed to maintain data.

I believe LabRat 891 put it perfectly, it makes sense as another layer in the memory subsystem designed to hold a specific set of data and the overall workload will see a very nice benefit as a result.
 
Joined
Oct 8, 2015
Messages
783 (0.23/day)
Location
Earth's Troposphere
System Name 3 "rigs"-gaming/spare pc/cruncher
Processor R7-5800X3D/i7-7700K/R9-7950X
Motherboard Asus ROG Crosshair VI Extreme/Asus Ranger Z170/Asus ROG Crosshair X670E-GENE
Cooling Bitspower monoblock ,custom open loop,both passive and active/air tower cooler/air tower cooler
Memory 32GB DDR4/32GB DDR4/64GB DDR5
Video Card(s) Gigabyte RX6900XT Alphacooled/AMD RX5700XT 50th Aniv./SOC(onboard)
Storage mix of sata ssds/m.2 ssds/mix of sata ssds+an m.2 ssd
Display(s) Dell UltraSharp U2410 , HP 24x
Case mb box/Silverstone Raven RV-05/CoolerMaster Q300L
Audio Device(s) onboard/onboard/onboard
Power Supply 3 Seasonics, a DeltaElectronics, a FractalDesing
Mouse various/various/various
Keyboard various wired and wireless
VR HMD -
Software W10.someting or another,all 3
It can operate at what bandwidth dear sir?Bandwidth is in its name after all.
1bit per second, arbitrary value.
 
Joined
Mar 21, 2016
Messages
2,586 (0.79/day)
If it isn't readily serviceable and replaceable the NAND seems like a serious e-waste concern for the rest of the hardware if the NAND degrades too quickly. It might be acceptable for AI depending on longevity, but probably not so much otherwise.
 
Joined
Jan 3, 2021
Messages
3,851 (2.55/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Non-volatile memory yields power and cost savings. There are dozens of articles on the topic: https://www.embedded.com/the-benefit-of-non-volatile-memory-nvm-for-edge-ai/


It allows you to take fetches that would otherwise be to main system memory or mass storage and put them right on the chip. This lowers latency and power consumption. In addition, flash doesn't needed to be constantly refreshed when not actively in use so you can very aggressively power tune it. This is simply not possible with volatile memory that needs to be refreshed to maintain data.

I believe LabRat 891 put it perfectly, it makes sense as another layer in the memory subsystem designed to hold a specific set of data and the overall workload will see a very nice benefit as a result.
I don't disagree, NAND does have some advantages, but non-volatility by itself is not important unless and until power goes out. A theoretical volatile NAND with extremely low idle power (similar to SRAM) would do this job just as well, that's my point.

Anther reminder it was dumb to kill off xpoint, it would have been as in demand as hbm for the last 2 years
We can't be sure it's dead. The development continues somewhere deep under the ground and will continue until all patents expire. TI (or whoever) may succeed in developing a method to expand those 4 layers to 100+ ... but it's not a given.

It can operate at what bandwidth dear sir?Bandwidth is in its name after all.
1bit per second, arbitrary value.
HBM sends the data around at about 6400 MT/s, and NAND does it at 3200 MT/s. So, as a quick estimate, half of HBM's bandwidth would be possible with the technology we already have.
 
Joined
Jul 5, 2013
Messages
29,237 (6.88/day)
SanDisk Develops HBM Killer: High-Bandwidth Flash (HBF)
Wait, what?...
While HBF surpasses HBM's capacity specifications, it maintains higher latency than DRAM, limiting its application to specific workloads.
So this is NAND... HBM is DRAM...

NOT an HBM killer. That was a very click-bait headline. Come on peeps, TPU is better than that crap..
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,798 (1.02/day)
Wait, what?...

So this is NAND... HBM is DRAM...

NOT an HBM killer. That was a very click-bait headline. Come on peeps, TPU is better than that crap..
For AI workloads its HBM killer(despite being not the same tech fundamentally). Imagine you load an entire model on a single GPU. You don't need top-tier low-latency.
 
Joined
Mar 21, 2016
Messages
2,586 (0.79/day)
If they can somehow do this on a M.2 attached to a GPU and get similar results it would be great. If it's just replacing volatile VRAM with NAND and questionable endurance concerns maybe not as exciting. From a business standpoint it could still make a lot of sense though if the economics of it make enough sense in terms of profitability.
 
Joined
Jul 5, 2013
Messages
29,237 (6.88/day)
For AI workloads its HBM killer(despite being not the same tech fundamentally). Imagine you load an entire model on a single GPU. You don't need top-tier low-latency.
While that's a fair point, I was referring to the durability factor. NAND wears out, and under these kinds of load, would wear out swiftly. This is fact and can not be argued. DRAM does not wear out.

That was my point. For that reason alone, HBF is NOT an HBM killer. Until we have a major break-through in NAND flash durability it will not change. All Sandisk has done is create mildly and temporarily useful E-waste.
 
Joined
Jun 22, 2012
Messages
322 (0.07/day)
Processor Intel i7-12700K
Motherboard MSI PRO Z690-A WIFI
Cooling Noctua NH-D15S
Memory Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s) MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case Fractal Define C
Power Supply Corsair RM850x
Mouse Logitech G203
Software openSUSE Tumbleweed
They're already saying it's for AI inference, i.e. mostly read-centric workloads where most of the bandwidth utilization is reading model weights (in the hundreds of gigabytes to few terabytes range). Nothing prohibits hardware manufacturers from putting VRAM or HBM alongside the HBF for memory content that needs to be frequently modified (mainly the key-value cache during token generation).
 

Attachments

  • 1739537039325.png
    1739537039325.png
    111.6 KB · Views: 15
  • 1739537164662.png
    1739537164662.png
    150.6 KB · Views: 11
Joined
Jul 5, 2013
Messages
29,237 (6.88/day)
They're already saying it's for AI inference, i.e. mostly read-centric workloads where most of the bandwidth utilization is reading model weights (in the hundreds of gigabytes to few terabytes range).
While that is a reasonable point, NAND flash simply doesn't have the durability to be useful long term in such a way.
Nothing prohibits hardware manufacturers from putting VRAM or HBM alongside the HBF for memory content that needs to be frequently modified (mainly the key-value cache during token generation).
Another reasonable point, however, that was not the claim made in the above article.
 
Top