Thursday, December 21st 2023

Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops

Apple has been experimenting with Large Language Models (LLMs) that power most of today's AI applications. The company wants these LLMs to serve the users best and deliver them efficiently, which is a difficult task as they require a lot of resources, including compute and memory. Traditionally, LLMs have required AI accelerators in combination with large DRAM capacity to store model weights. However, Apple has published a paper that aims to bring LLMs to devices with limited memory capacity. By storing LLMs on NAND flash memory (regular storage), the method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding optimization in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Instead of storing the model weights on DRAM, Apple wants to utilize flash memory to store weights and only pull them on-demand to DRAM once it is needed.

Two principal techniques are introduced within this flash memory-informed framework: "windowing" and "row-column bundling." These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to native loading approaches on CPU and GPU, respectively. Integrating sparsity awareness, context-adaptive loading, and a hardware-oriented design pave the way for practical inference of LLMs on devices with limited memory, such as SoCs with 8/16/32 GB of available DRAM. Especially with DRAM prices outweighing NAND Flash, setups such as smartphone configurations could easily store and inference LLMs with multi-billion parameters, even if the DRAM available isn't sufficient. For a more technical deep dive, read the paper on arXiv here.
Source: Apple (arXiv Paper)
Add your own comment

24 Comments on Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops

#2
AleksandarK
News Editor
phanbueyThis is done by a company here locally that seems to work fairly well. The flash cells can hold various charge values and are good for storing pre-baked AI neurons.

Power-efficient analog compute for edge AI - Mythic

basically they're using the flash as an analog computer.
This is processing in memory, Apple just uses flash to store weights and only pull the weights to DRAM if needed. Not processing in flash
Posted on Reply
#3
phanbuey
AleksandarKThis is processing in memory, Apple just uses flash to store weights and only pull the weights to DRAM if needed. Not processing in flash
I think it's the same thing, they just call it "Analog computing" - but they still have a digital processor to interpret and makes the data from the model usable.

" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"
Posted on Reply
#4
AleksandarK
News Editor
phanbueyI think it's the same thing, they just call it "Analog computing" - but they still have a digital processor to interpret and makes the data from the model usable.

" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"
It is not the same thing as ACE, which actually processes the data in memory. What Apple is proposing is still separate processing and memory elements.
Posted on Reply
#5
phanbuey
AleksandarKIt is not the same thing as ACE, which actually processes the data in memory. What Apple is proposing is still separate processing and memory elements.
but if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.
Posted on Reply
#6
AleksandarK
News Editor
phanbueybut if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.
"Our analog compute takes compute-in-memory to an extreme, where we compute directly inside the memory array itself. This is possible by using the memory elements as tunable resistors, supplying the inputs as voltages, and collecting the outputs as currents. We use analog computing for our core neural network matrix operations, where we are multiplying an input vector by a weight matrix"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further. :)
Posted on Reply
#7
phanbuey
AleksandarK"Our analog compute takes compute-in-memory to an extreme, where we compute directly inside the memory array itself. This is possible by using the memory elements as tunable resistors, supplying the inputs as voltages, and collecting the outputs as currents. We use analog computing for our core neural network matrix operations, where we are multiplying an input vector by a weight matrix"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further. :)
Fascinating :toast:.
Posted on Reply
#8
bug
Keep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.
Posted on Reply
#9
AleksandarK
News Editor
bugKeep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.
Those are cloud resources; all we care about is local resources at our fingertips! ;)
Posted on Reply
#10
bug
AleksandarKThose are cloud resources; all we care about is local resources at our fingertips! ;)
I was just making the distinction, lest we get people excited about how Apple is able to put a whole cloud onto a magic Apple stick...
Posted on Reply
#11
AnotherReader
phanbueybut if you read what they're saying it does... it actually doesn't process anything "Analog Compute Engine (Mythic ACE™) to store bulky neural network weights" then it passes to SRAM, then to the SIMD. I'm sure it's more complicated then that and there is data processing and in-memory computing functionality in the architecture... but seems like the concept of storing neural weights in nand is something that makes sense for AI at the edge.
After reading the paper, this is a clever approach to minimize the amount of data transferred from NAND to DRAM by taking advantage of the sparsity of common LLMs such as GPT-3 and OPT.
bugKeep in mind it's training the LLMs that eats a ton of resources, not the LLMs themselves. It's no different from "ordinary" neural networks.
While training is the most resource intensive part by definition, using the model for inference for hundred of millions or even billions of times will overcome the cost of training as that is a one time cost. Therefore, improving the cost of inference is beneficial too. It also allows more inference to be done on the client side.

Note: The cost for training GPT-3 is estimated to be 3.1 * 10^23 flops while the cost of a single inference operation is estimated to be 740 TFLOPs. After 420 million inferences, the cost of inference will surpass that of training.
Posted on Reply
#12
bug
AnotherReaderWhile training is the most resource intensive part by definition, using the model for inference for hundred of millions or even billions of times will overcome the cost of training as that is a one time cost. Therefore, improving the cost of inference is beneficial too. It also allows more inference to be done on the client side.
I'm not sure about LLMs, but for NNs, inference is almost zero cost. A bunch of additions and multiplications that will barely register on any recent CPU.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.
Posted on Reply
#13
pk67
with a 4-5x and 20-25x increase in inference speed compared to naive loading approaches on CPU and GPU
Naive is not the right term author want to use I guess ;)
Posted on Reply
#14
AnotherReader
bugI'm not sure about LLMs, but for NNs, inference is almost zero cost. A bunch of additions and multiplications that will barely register on any recent CPU.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.
In isolation, the cost of one single inference is miniscule compared to the training cost. However, the volume of queries per day (10 million for ChatGPT in early 2023) is enough to ensure that inference cost for a GPT-3 trained model will have surpassed training cost in 42 days.
Posted on Reply
#15
FrostWolf
This is where I insert the (kidding) comment that this is the first time Apple wanted to sell devices with more memory. ;^)

(Disclaimer: I use an iPhone, and I have no personal issues with Macs).
Posted on Reply
#16
SOAREVERSOR
FrostWolfThis is where I insert the (kidding) comment that this is the first time Apple wanted to sell devices with more memory. ;^)

(Disclaimer: I use an iPhone, and I have no personal issues with Macs).
Oh they want to sell devices with more memory! It's what they want to charge for it that's been the issue ;)
Posted on Reply
#17
bug
AnotherReaderIn isolation, the cost of one single inference is miniscule compared to the training cost. However, the volume of queries per day (10 million for ChatGPT in early 2023) is enough to ensure that inference cost for a GPT-3 trained model will have surpassed training cost in 42 days.
Right. And doubling the inference speed would net you what? 84 days before you surpass it? That's why I said the big thing here is moving inference to client. In my mind that takes precedence over the speed of the inference itself.
pk67Naive is not the right term author want to use I guess ;)
It really is. That the common term in software engineering for an implementation that aims for nothing but just proving some concept works.
Designing data structures around a particular set of problems is nothing new either.
Posted on Reply
#18
Wirko
The most surprising part of this story is how much technical information Apple is willing to share with the world.
Posted on Reply
#19
Tomorrow
The company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...
Posted on Reply
#20
kondamin
TomorrowThe company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...
Should have bought in to intel and micron 3dxpoint
Posted on Reply
#21
Shihab
Apple discovers asset streaming...
Posted on Reply
#22
bug
TomorrowThe company who uses soldered low grade NAND that is not properly parallelized, wants to put write intensive stuff on NAND? Yeah im sure that will work out well...
LLMs are not write intensive. At least not past their training stage. They may add your conversations to their existing knowledge base, but other than that, they're pretty much read-only.

But again, offloading server processing to clients and finding more efficient ways to represent data is something done as a routine throughout the industry. It only becomes remarkable when Apple does it... And they're not even the first to do it, afaik Mozilla was the first to announce they want to let you use AI on your local machine (among other things).
Posted on Reply
#23
Wirko
Apple is actually right, SSDs are thrice cheaper per terabyte than even the cheapest peasant DDR4-2400. I think it's that A.666 form factor to blame for cheapness.

Posted on Reply
#24
pk67
bugIt really is. That the common term in software engineering for an implementation that aims for nothing but just proving some concept works.
Designing data structures around a particular set of problems is nothing new either.
I'm not sure you are commenting the right term really. Native fit to the context imho but naive not.
Posted on Reply
Add your own comment
Jan 19th, 2025 07:34 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts