• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

SanDisk Develops HBM Killer: High-Bandwidth Flash (HBF) Allows 4 TB of VRAM for AI GPUs

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,786 (1.02/day)
During its first post-Western Digital spinoff investor day, SanDisk showed something it has been working on to tackle the AI sector. High-bandwidth flash (HBF) is a new memory architecture that combines 3D NAND flash storage with bandwidth capabilities comparable to high-bandwidth memory (HBM). The HBF design stacks 16 3D NAND BiCS8 dies using through-silicon vias, with a logic layer enabling parallel access to memory sub-arrays. This configuration achieves 8 to 16 times greater capacity per stack than current HBM implementations. A system using eight HBF stacks can provide 4 TB of VRAM to store large AI models like GPT-4 directly on GPU hardware. The architecture breaks from conventional NAND design by implementing independently accessible memory sub-arrays, moving beyond traditional multi-plane approaches. While HBF surpasses HBM's capacity specifications, it maintains higher latency than DRAM, limiting its application to specific workloads.

SanDisk has not disclosed its solution for NAND's inherent write endurance limitations, though using pSLC NAND makes it possible to balance durability and cost. The bandwidth of HBF is also unknown, as the company hasn't put out details yet. SanDisk Memory Technology Chief Alper Ilkbahar confirmed the technology targets read-intensive AI inference tasks rather than latency-sensitive applications. The company is developing HBF as an open standard, incorporating mechanical and electrical interfaces similar to HBM to simplify integration. Some challenges remain, including NAND's block-level addressing limitations and writing endurance constraints. While these factors make HBF unsuitable for gaming applications, the technology's high capacity and throughput characteristics align with AI model storage and inference requirements. SanDisk has announced plans for three generations of HBF development, indicating a long-term commitment to the technology.



View at TechPowerUp Main Site | Source
 
Joined
Jan 3, 2021
Messages
3,824 (2.54/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Some challenges remain, including NAND's block-level addressing limitations and writing endurance constraints.
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
 
Joined
Nov 26, 2021
Messages
1,824 (1.55/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
It isn't just the granularity of transfer. DRAM has unlimited endurance; this, on the other hand, is unlikely to be much better than SLC.
 
Joined
Oct 18, 2013
Messages
6,416 (1.55/day)
Location
So close that even your shadow can't see me !
System Name The Little One
Processor i5-11320H @4.4GHZ
Motherboard AZW SEI
Cooling Fan w/heat pipes + side & rear vents
Memory 64GB Crucial DDR4-3200 (2x 32GB)
Video Card(s) Iris XE
Storage WD Black SN850X 8TB m.2, Seagate 2TB SSD + SN850 8TB x2 in an external enclosure
Display(s) 2x Samsung 43" & 2x 32"
Case Practically identical to a mac mini, just purrtier in slate blue, & with 3x usb ports on the front !
Audio Device(s) Yamaha ATS-1060 Bluetooth Soundbar & Subwoofer
Power Supply 65w brick
Mouse Logitech MX Master 2
Keyboard Logitech G613 mechanical wireless
VR HMD Whahdatiz ???
Software Windows 10 pro, with all the unnecessary background shitzu turned OFF !
Benchmark Scores PDQ
It's actually page-level reading/writing and block-level erasing, where a page is 4 ki-bi-bytes and a block is a few megabytes. However, the architecture of HBF seems to be a lot different, and the page size may also be smaller (or larger) if Sandisk thinks it's better for the purpose.
Even DRAM is far from being byte-addressable; the smallest unit of transfer is 64 bytes in DDR, 32 bytes in HBM3, and I think it's the same in HBM3E.
It isn't just the granularity of transfer. DRAM has unlimited endurance; this, on the other hand, is unlikely to be much better than SLC.
Soooo...what ya'll are saying is that there won't be any 4TB, $25K GPU's for us gamrz to drool over, at least not for a while anyways ?

Aw so sad :D


n.O.t.....
 
Joined
Dec 29, 2020
Messages
228 (0.15/day)
I assume it is mostly meant for Large AI models, which require quite a lot of vram to run. Performance as memory will not be great but if it's performant enough, with the dram on top, it may very well be good enough.
If so good development to bring costs down for these.
 
Joined
Jul 28, 2020
Messages
8 (0.00/day)
Previous attempts to use non-RAM as RAM failed.
The most famous one was Intel/Micron Optane/3D XPoint.
It doesn't seem that this one will do any better.
 
Joined
Jan 3, 2021
Messages
3,824 (2.54/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Soooo...what ya'll are saying is that there won't be any 4TB, $25K GPU's for us gamrz to drool over, at least not for a while anyways ?

Aw so sad :D


n.O.t.....
The trend is obviously towards 8GB, $25K GPUs, but the frog tastes best if cooked slowly.

Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.
The models would only be updated occasionally, that's the idea, so writing wouldn't be much of a problem. But limited read endurance is also sometimes hinted at. I don't know how much research has been done around read degradation, and whether it's relevant. Anyway, processing needs RAM too, a couple hundred MB of static RAM cache can't suffice for that, so inevitably some HBM will be part of the system too.
 
Joined
Apr 18, 2019
Messages
2,484 (1.17/day)
Location
Olympia, WA
System Name Sleepy Painter
Processor AMD Ryzen 5 3600
Motherboard Asus TuF Gaming X570-PLUS/WIFI
Cooling FSP Windale 6 - Passive
Memory 2x16GB F4-3600C16-16GVKC @ 16-19-21-36-58-1T
Video Card(s) MSI RX580 8GB
Storage 2x Samsung PM963 960GB nVME RAID0, Crucial BX500 1TB SATA, WD Blue 3D 2TB SATA
Display(s) Microboard 32" Curved 1080P 144hz VA w/ Freesync
Case NZXT Gamma Classic Black
Audio Device(s) Asus Xonar D1
Power Supply Rosewill 1KW on 240V@60hz
Mouse Logitech MX518 Legend
Keyboard Red Dragon K552
Software Windows 10 Enterprise 2019 LTSC 1809 17763.1757
This HBF looks 'useful', but not on its own.
Inb4 tiered memory standards for Compute/Graphics?

Top: L1-3 caches
Upper: HBM
Lower: HBF
Bottom: NAND

Stack it 'till it's cheap :laugh:
 
Joined
Feb 18, 2005
Messages
6,088 (0.83/day)
Location
Ikenai borderline!
System Name Firelance.
Processor Threadripper 3960X
Motherboard ROG Strix TRX40-E Gaming
Cooling IceGem 360 + 6x Arctic Cooling P12
Memory 8x 16GB Patriot Viper DDR4-3200 CL16
Video Card(s) MSI GeForce RTX 4060 Ti Ventus 2X OC
Storage 2TB WD SN850X (boot), 4TB Crucial P3 (data)
Display(s) Dell S3221QS(A) (32" 38x21 60Hz) + 2x AOC Q32E2N (32" 25x14 75Hz)
Case Enthoo Pro II Server Edition (Closed Panel) + 6 fans
Power Supply Fractal Design Ion+ 2 Platinum 760W
Mouse Logitech G604
Keyboard Razer Pro Type Ultra
Software Windows 10 Professional x64
Yeah, no. NAND flash is not RAM, it is designed for entirely different usage patterns, and the notion that it could be used as a replacement for RAM is nonsensical. Considering GPUs already have effectively direct access to storage via APIs like DirectStorage, I see no use-case for this technology.
 
Joined
Jul 13, 2016
Messages
3,495 (1.11/day)
Processor Ryzen 7800X3D
Motherboard ASRock X670E Taichi
Cooling Noctua NH-D15 Chromax
Memory 32GB DDR5 6000 CL30
Video Card(s) MSI RTX 4090 Trio
Storage P5800X 1.6TB 4x 15.36TB Micron 9300 Pro 4x WD Black 8TB M.2
Display(s) Acer Predator XB3 27" 240 Hz
Case Thermaltake Core X9
Audio Device(s) JDS Element IV, DCA Aeon II
Power Supply Seasonic Prime Titanium 850w
Mouse PMM P-305
Keyboard Wooting HE60
VR HMD Valve Index
Software Win 10
Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.

It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
 
Joined
Apr 18, 2019
Messages
2,484 (1.17/day)
Location
Olympia, WA
System Name Sleepy Painter
Processor AMD Ryzen 5 3600
Motherboard Asus TuF Gaming X570-PLUS/WIFI
Cooling FSP Windale 6 - Passive
Memory 2x16GB F4-3600C16-16GVKC @ 16-19-21-36-58-1T
Video Card(s) MSI RX580 8GB
Storage 2x Samsung PM963 960GB nVME RAID0, Crucial BX500 1TB SATA, WD Blue 3D 2TB SATA
Display(s) Microboard 32" Curved 1080P 144hz VA w/ Freesync
Case NZXT Gamma Classic Black
Audio Device(s) Asus Xonar D1
Power Supply Rosewill 1KW on 240V@60hz
Mouse Logitech MX518 Legend
Keyboard Red Dragon K552
Software Windows 10 Enterprise 2019 LTSC 1809 17763.1757
It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Looking towards 'applications' in a given 'product',
maybe, parts of a Model can better utilize different kinds of storage?

I'm thinking:
"Working memory" Cache, RAM, HBM.
"Short-Term Memory" HBF, XLflash, phase-change memory, massively parallelized (p)SLC NAND.
"Long-Term Memory" TLC and QLC NAND.
"Archival Memory" HDDs and Magnetic Tape.
 
Joined
Jan 2, 2019
Messages
187 (0.08/day)
Correct me if I'm wrong... but AI models would rapidly degrade NAND storage, making it impractical for long-term use, since NAND ( when used as VRAM) requires continuous high-frequency read and write operations.

There are two cases: Training ( write ops ) and Inference ( read ops, where they intend to use HBF ). Its overall indurance depends on Terrabyte Written ( TBW ).

Also, that new technology could affect progress of CXL Memory Expanders ( very expensive stuff right now ). 4TB inside of a GPU is a lot of memory for processing!
 
Joined
Jan 3, 2021
Messages
3,824 (2.54/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
It depends, is the AI model being constantly loaded or can it simply stay in memory?

If they are using flash here it may be non-volatile which could make it quite flexible.
Non-volatility means little, or nothing, in this kind of applications. The processors will crunch vectors and matrices without interruption until they're too old and can't make enough money anymore. (Well, low power and sleep states probably exist too, since all processors can't be fully loaded all of the time.)

Also, that new technology could affect progress of CXL Memory Expanders ( very expensive stuff right now ).
I don't see a close connection. CXL is PCIe, which is up to 16 lanes of Gen 6 (maybe soon in AI datacenters) or Gen 7 (a few years out). That's infinitely slower than several stacks of on-package HBM/HBF optimised for maximum bandwidth and maximum cost.
 
Top