Thursday, December 21st 2023
Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops
Apple has been experimenting with Large Language Models (LLMs) that power most of today's AI applications. The company wants these LLMs to serve the users best and deliver them efficiently, which is a difficult task as they require a lot of resources, including compute and memory. Traditionally, LLMs have required AI accelerators in combination with large DRAM capacity to store model weights. However, Apple has published a paper that aims to bring LLMs to devices with limited memory capacity. By storing LLMs on NAND flash memory (regular storage), the method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding optimization in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks. Instead of storing the model weights on DRAM, Apple wants to utilize flash memory to store weights and only pull them on-demand to DRAM once it is needed.
Two principal techniques are introduced within this flash memory-informed framework: "windowing" and "row-column bundling." These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to native loading approaches on CPU and GPU, respectively. Integrating sparsity awareness, context-adaptive loading, and a hardware-oriented design pave the way for practical inference of LLMs on devices with limited memory, such as SoCs with 8/16/32 GB of available DRAM. Especially with DRAM prices outweighing NAND Flash, setups such as smartphone configurations could easily store and inference LLMs with multi-billion parameters, even if the DRAM available isn't sufficient. For a more technical deep dive, read the paper on arXiv here.
Source:
Apple (arXiv Paper)
Two principal techniques are introduced within this flash memory-informed framework: "windowing" and "row-column bundling." These methods collectively enable running models up to twice the size of the available DRAM, with a 4-5x and 20-25x increase in inference speed compared to native loading approaches on CPU and GPU, respectively. Integrating sparsity awareness, context-adaptive loading, and a hardware-oriented design pave the way for practical inference of LLMs on devices with limited memory, such as SoCs with 8/16/32 GB of available DRAM. Especially with DRAM prices outweighing NAND Flash, setups such as smartphone configurations could easily store and inference LLMs with multi-billion parameters, even if the DRAM available isn't sufficient. For a more technical deep dive, read the paper on arXiv here.
24 Comments on Apple Wants to Store LLMs on Flash Memory to Bring AI to Smartphones and Laptops
Power-efficient analog compute for edge AI - Mythic
basically they're using the flash as an analog computer.
" Each tile has a large Analog Compute Engine (Mythic ACE™) to store bulky neural network weights, local SRAM memory for data being passed between the neural network nodes, a single-instruction multiple-data (SIMD) unit for processing operations not handled by the ACE"
It actually processes matrix multiply and accumulate (MAC) operations, which is integral to neural network operation flow. Such data is then pushed further. :)
Note: The cost for training GPT-3 is estimated to be 3.1 * 10^23 flops while the cost of a single inference operation is estimated to be 740 TFLOPs. After 420 million inferences, the cost of inference will surpass that of training.
The biggest part in this would be moving it from the cloud to the client. Even then, it's not that different from what JS did when it started offloading processing to the client (browser). It's a welcome option, but really nothing to rave about.
(Disclaimer: I use an iPhone, and I have no personal issues with Macs).
Designing data structures around a particular set of problems is nothing new either.
But again, offloading server processing to clients and finding more efficient ways to represent data is something done as a routine throughout the industry. It only becomes remarkable when Apple does it... And they're not even the first to do it, afaik Mozilla was the first to announce they want to let you use AI on your local machine (among other things).