• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,725 (1.01/day)
Apple has been experimenting with Large Language Models (LLMs) that power most of today's AI applications. The company wants these LLMs to serve the users best and deliver them efficiently, which is a difficult task as they require a lot of resources, including compute and memory. Traditionally, LLMs have required AI accelerators in combination with large DRAM capacity to store model weights. However, Apple has published a paper that aims to bring LLMs to devices with limited memory capacity. By storing LLMs on NAND flash memory (regular storage), the method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding optimization in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Instead of storing the model weights on DRAM, Apple wants to utilize flash memory to store weights and only pull them on-demand to DRAM once it is needed.

Two principal techniques are introduced within this flash memory-informed framework: "windowing" and "row-column bundling." These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to native loading approaches on CPU and GPU, respectively. Integrating sparsity awareness, context-adaptive loading, and a hardware-oriented design pave the way for practical inference of LLMs on devices with limited memory, such as SoCs with 8/16/32 GB of available DRAM. Especially with DRAM prices outweighing NAND Flash, setups such as smartphone configurations could easily store and inference LLMs with multi-billion parameters, even if the DRAM available isn't sufficient. For a more technical deep dive, read the paper on arXiv here.



View at TechPowerUp Main Site | Source
 
Joined
Nov 13, 2007
Messages
10,919 (1.74/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 1:1 CL30-36-36-96 FCLK 2000
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply RIP Corsair SF750... Waiting for SF1000
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
This is done by a company here locally that seems to work fairly well. The flash cells can hold various charge values and are good for storing pre-baked AI neurons.

Power-efficient analog compute for edge AI - Mythic

basically they're using the flash as an analog computer.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,725 (1.01/day)
This is done by a company here locally that seems to work fairly well. The flash cells can hold various charge values and are good for storing pre-baked AI neurons.

Power-efficient analog compute for edge AI - Mythic

basically they're using the flash as an analog computer.
This is processing in memory, Apple just uses flash to store weights and only pull the weights to DRAM if needed. Not processing in flash
 
Joined
Nov 13, 2007
Messages
10,919 (1.74/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 1:1 CL30-36-36-96 FCLK 2000
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply RIP Corsair SF750... Waiting for SF1000
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
This is processing in memory, Apple just uses flash to store weights and only pull the weights to DRAM if needed. Not processing in flash
I think it's the same thing, they just call it "Analog computing" - but they still have a digital processor to interpret and makes the data from the model usable.

" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,725 (1.01/day)
I think it's the same thing, they just call it "Analog computing" - but they still have a digital processor to interpret and makes the data from the model usable.

" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"
It is not the same thing as ACE, which actually processes the data in memory. What Apple is proposing is still separate processing and memory elements.
 
Joined
Nov 13, 2007
Messages
10,919 (1.74/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 1:1 CL30-36-36-96 FCLK 2000
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply RIP Corsair SF750... Waiting for SF1000
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
It is not the same thing as ACE, which actually processes the data in memory. What Apple is proposing is still separate processing and memory elements.
but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,725 (1.01/day)
but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.
"Our analog compute takes compute-in-memory to an extreme, where we compute directly inside the memory array itself. This is possible by using the memory elements as tunable resistors, supplying the inputs as voltages, and collecting the outputs as currents. We use analog computing for our core neural network matrix operations, where we are multiplying an input vector by a weight matrix"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further. :)
 
Joined
Nov 13, 2007
Messages
10,919 (1.74/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6000 1:1 CL30-36-36-96 FCLK 2000
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply RIP Corsair SF750... Waiting for SF1000
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
"Our analog compute takes compute-in-memory to an extreme, where we compute directly inside the memory array itself. This is possible by using the memory elements as tunable resistors, supplying the inputs as voltages, and collecting the outputs as currents. We use analog computing for our core neural network matrix operations, where we are multiplying an input vector by a weight matrix"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further. :)
Fascinating :toast:.
 

bug

Joined
May 22, 2015
Messages
13,942 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,725 (1.01/day)
Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.
Those are cloud resources; all we care about is local resources at our fingertips! ;)
 

bug

Joined
May 22, 2015
Messages
13,942 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Those are cloud resources; all we care about is local resources at our fingertips! ;)
I was just making the distinction, lest we get people excited about how Apple is able to put a whole cloud onto a magic Apple stick...
 
Joined
Nov 26, 2021
Messages
1,730 (1.50/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.
After reading the paper, this is a clever approach to minimize the amount of data transferred from NAND to DRAM by taking advantage of the sparsity of common LLMs such as GPT-3 and OPT.

Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.
While training is the most resource intensive part by definition, using the model for inference for hundred of millions or even billions of times will overcome the cost of training as that is a one time cost. Therefore, improving the cost of inference is beneficial too. It also allows more inference to be done on the client side.

Note: The cost for training GPT-3 is estimated to be 3.1 * 10^23 flops while the cost of a single inference operation is estimated to be 740 TFLOPs. After 420 million inferences, the cost of inference will surpass that of training.
 
Last edited:

bug

Joined
May 22, 2015
Messages
13,942 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
While training is the most resource intensive part by definition, using the model for inference for hundred of millions or even billions of times will overcome the cost of training as that is a one time cost. Therefore, improving the cost of inference is beneficial too. It also allows more inference to be done on the client side.
I'm not sure about LLMs, but for NNs, inference is almost zero cost. A bunch of additions and multiplications that will barely register on any recent CPU.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.
 
Joined
Nov 26, 2021
Messages
1,730 (1.50/day)
Location
Mississauga, Canada
Processor Ryzen 7 5700X
Motherboard ASUS TUF Gaming X570-PRO (WiFi 6)
Cooling Noctua NH-C14S (two fans)
Memory 2x16GB DDR4 3200
Video Card(s) Reference Vega 64
Storage Intel 665p 1TB, WD Black SN850X 2TB, Crucial MX300 1TB SATA, Samsung 830 256 GB SATA
Display(s) Nixeus NX-EDG27, and Samsung S23A700
Case Fractal Design R5
Power Supply Seasonic PRIME TITANIUM 850W
Mouse Logitech
VR HMD Oculus Rift
Software Windows 11 Pro, and Ubuntu 20.04
I'm not sure about LLMs, but for NNs, inference is almost zero cost. A bunch of additions and multiplications that will barely register on any recent CPU.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.
In isolation, the cost of one single inference is miniscule compared to the training cost. However, the volume of queries per day (10 million for ChatGPT in early 2023) is enough to ensure that inference cost for a GPT-3 trained model will have surpassed training cost in 42 days.
 
Joined
Nov 9, 2022
Messages
39 (0.05/day)
This is where I insert the (kidding) comment that this is the first time Apple wanted to sell devices with more memory. ;^)

(Disclaimer: I use an iPhone, and I have no personal issues with Macs).
 
Joined
Apr 13, 2022
Messages
1,245 (1.23/day)
This is where I insert the (kidding) comment that this is the first time Apple wanted to sell devices with more memory. ;^)

(Disclaimer: I use an iPhone, and I have no personal issues with Macs).

Oh they want to sell devices with more memory! It's what they want to charge for it that's been the issue ;)
 

bug

Joined
May 22, 2015
Messages
13,942 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
In isolation, the cost of one single inference is miniscule compared to the training cost. However, the volume of queries per day (10 million for ChatGPT in early 2023) is enough to ensure that inference cost for a GPT-3 trained model will have surpassed training cost in 42 days.
Right. And doubling the inference speed would net you what? 84 days before you surpass it? That's why I said the big thing here is moving inference to client. In my mind that takes precedence over the speed of the inference itself.

Naive is not the right term author want to use I guess ;)
It really is. That the common term in software engineering for an implementation that aims for nothing but just proving some concept works.
Designing data structures around a particular set of problems is nothing new either.
 
Joined
Jan 3, 2021
Messages
3,712 (2.51/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
The most surprising part of this story is how much technical information Apple is willing to share with the world.
 
Joined
Aug 21, 2013
Messages
1,980 (0.47/day)
The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...
 
Joined
Jan 11, 2022
Messages
1,008 (0.91/day)
The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...
Should have bought in to intel and micron 3dxpoint
 
Joined
Jan 10, 2011
Messages
1,462 (0.29/day)
Location
[Formerly] Khartoum, Sudan.
System Name 192.168.1.1~192.168.1.100
Processor AMD Ryzen5 5600G.
Motherboard Gigabyte B550m DS3H.
Cooling AMD Wraith Stealth.
Memory 16GB Crucial DDR4.
Video Card(s) Gigabyte GTX 1080 OC (Underclocked, underpowered).
Storage Samsung 980 NVME 500GB && Assortment of SSDs.
Display(s) ViewSonic VA2406-MH 75Hz
Case Bitfenix Nova Midi
Audio Device(s) On-Board.
Power Supply SeaSonic CORE GM-650.
Mouse Logitech G300s
Keyboard Kingston HyperX Alloy FPS.
VR HMD A pair of OP spectacles.
Software Ubuntu 24.04 LTS.
Benchmark Scores Me no know English. What bench mean? Bench like one sit on?
Apple discovers asset streaming...
 

bug

Joined
May 22, 2015
Messages
13,942 (3.95/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...
LLMs are not write intensive. At least not past their training stage. They may add your conversations to their existing knowledge base, but other than that, they're pretty much read-only.

But again, offloading server processing to clients and finding more efficient ways to represent data is something done as a routine throughout the industry. It only becomes remarkable when Apple does it... And they're not even the first to do it, afaik Mozilla was the first to announce they want to let you use AI on your local machine (among other things).
 
Joined
Jan 3, 2021
Messages
3,712 (2.51/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
Apple is actually right, SSDs are thrice cheaper per terabyte than even the cheapest peasant DDR4-2400. I think it's that A.666 form factor to blame for cheapness.

1703241081615.png
 
Joined
May 26, 2023
Messages
109 (0.18/day)
It really is. That the common term in software engineering for an implementation that aims for nothing but just proving some concept works.
Designing data structures around a particular set of problems is nothing new either.
I'm not sure you are commenting the right term really. Native fit to the context imho but naive not.
 
Top