Google: CPUs are Leading AI Inference Workloads, Not GPUs

AleksandarK · Mar 4, 2024

The AI infrastructure of today is mostly fueled by the expansion that relies on GPU-accelerated servers. Google, one of the world's largest hyperscalers, has noted that CPUs are still a leading compute for AI/ML workloads, recorded on their Google Cloud Services cloud internal analysis. During the TechFieldDay event, a speech by Brandon Royal, product manager at Google Cloud, explained the position of CPUs in today's AI game. The AI lifecycle is divided into two parts: training and inference. During training, massive compute capacity is needed, along with enormous memory capacity, to fit ever-expanding AI models into memory. The latest models, like GPT-4 and Gemini, contain billions of parameters and require thousands of GPUs or other accelerators working in parallel to train efficiently.

On the other hand, inference requires less compute intensity but still benefits from acceleration. The pre-trained model is optimized and deployed during inference to make predictions on new data. While less compute is needed than training, latency and throughput are essential for real-time inference. Google found out that, while GPUs are ideal for the training phase, models are often optimized and run inference on CPUs. This means that there are customers who choose CPUs as their medium of AI inference for a wide variety of reasons.

It can be a matter of cost and availability. CPUs tend to be cheaper and more readily available than high-end GPUs or specialized AI accelerators. For many applications, a CPU provides sufficient performance for inference at a lower cost. CPUs also offer flexibility. Since most systems already have CPUs, they provide an easy deployment path for smaller AI models. GPUs often require specialized libraries and drivers, while CPU-based inference can leverage existing infrastructure. This makes it simpler to integrate AI into existing products and workflows. Latency and throughput tradeoffs also come into play. GPUs excel at massively parallel throughput for inference. But CPUs can often provide lower latency for real-time requests. CPU inferencing may be preferred for applications like online recommendations that require sub-second response.

Additionally, CPU optimization for inferencing is progressing rapidly. Performance keeps improving, driven by faster clocks, more cores, and new instructions like Intel AVX-512 and AMX, AI workloads can be run smoothly on CPUs alone and are especially good if the server is configured with more than a single socket, meaning that more AI engines are present and the server can efficiently process AI models with multi-billion parameters in size. Generally, Intel notes that models up to 20 billion parameters work fine on a CPU, whereas anything larger must go to a specialized accelerator.

AI models like GPT-4, Claude, and Gemini are huge models that can reach more than a trillion parameter sizes. However, they are multi-modal, meaning they process text and video. A real-world enterprise workload may be an AI model inferencing a company's local documents to answer customer support questions. Running a model like GPT-4 would be an overkill for that solution. In contrast, significantly smaller models like LLAMA 2 or Mistral can serve similar purposes exceptionally well without requiring third-party API access and instead run on a local or cloud server with a few CPUs. This translates into lower total cost of ownership (TCO) and simplified AI pipelines.

View at TechPowerUp Main Site | Source

nguyen · Mar 4, 2024

Well I guess Google Gemini has been running on CPUs LOL.

The article saying GPU is better at training ML, would it just make more sense to use newer GPU to train and outdated GPU for inferencing instead of GPU for training and CPU for inferencing?

ExcuseMeWtf · Mar 4, 2024

Actually, and predictably, ASICs do:

AleksandarK · Mar 4, 2024

nguyen said:
Well I guess Google Gemini has been running on CPUs LOL.

The article saying GPU is better at training ML, would it just make more sense to use newer GPU to train and outdated GPU for inferencing instead of GPU for training and CPU for inferencing?

Gemini runs on Google TPUs.

The article states that models up to 20 billion parameters are fine on CPU, where anything larger goes to GPU/TPU/xPU.

Models from Mistral and Meta in 7-13B range are ideal for CPU inference, which is a real world use case. Not everyone needs a GPT4 API for simpler tasks.

nguyen · Mar 4, 2024

AleksandarK said:
Gemini runs on Google TPUs.

The article states that models up to 20 billion parameters are fine on CPU, where anything larger goes to GPU/TPU/xPU.

Models from Mistral and Meta in 7-13B range are ideal for CPU inference, which is a real world use case. Not everyone needs a GPT4 API for simpler tasks.

But after you use GPU to train the model, it's just cheaper to use those same GPU for inferencing task instead of buying CPU?

Why would any non-tech company need to keep on training AI for simple tasks?

AleksandarK · Mar 4, 2024

nguyen said:
But after you use GPU to train the model, it's just cheaper to use those same GPU for inferencing task instead of buying CPU?

Why would any non-tech company need to keep on training AI for simple tasks?

The training infra is never idle and is always training new models. Plus, not everyone has access to those GPUs. Training is exclusive to a few, inference is available to many.

Firedrops · Mar 4, 2024

Google: Our GPU instances cost about 50 billion times more than CPU instances or vs our competitors' GPU instances.

Also Google: WOW CPU IS ALL YOU NEED!

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus RTX4090 TUF
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX1200
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus RTX4090 TUF
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX1200
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

Google: CPUs are Leading AI Inference Workloads, Not GPUs

AleksandarK

News Editor

nguyen

ExcuseMeWtf

AleksandarK

News Editor

nguyen

AleksandarK

News Editor

Firedrops