Monday, March 4th 2024

Google: CPUs are Leading AI Inference Workloads, Not GPUs

The AI infrastructure of today is mostly fueled by the expansion that relies on GPU-accelerated servers. Google, one of the world's largest hyperscalers, has noted that CPUs are still a leading compute for AI/ML workloads, recorded on their Google Cloud Services cloud internal analysis. During the TechFieldDay event, a speech by Brandon Royal, product manager at Google Cloud, explained the position of CPUs in today's AI game. The AI lifecycle is divided into two parts: training and inference. During training, massive compute capacity is needed, along with enormous memory capacity, to fit ever-expanding AI models into memory. The latest models, like GPT-4 and Gemini, contain billions of parameters and require thousands of GPUs or other accelerators working in parallel to train efficiently.

On the other hand, inference requires less compute intensity but still benefits from acceleration. The pre-trained model is optimized and deployed during inference to make predictions on new data. While less compute is needed than training, latency and throughput are essential for real-time inference. Google found out that, while GPUs are ideal for the training phase, models are often optimized and run inference on CPUs. This means that there are customers who choose CPUs as their medium of AI inference for a wide variety of reasons.
It can be a matter of cost and availability. CPUs tend to be cheaper and more readily available than high-end GPUs or specialized AI accelerators. For many applications, a CPU provides sufficient performance for inference at a lower cost. CPUs also offer flexibility. Since most systems already have CPUs, they provide an easy deployment path for smaller AI models. GPUs often require specialized libraries and drivers, while CPU-based inference can leverage existing infrastructure. This makes it simpler to integrate AI into existing products and workflows. Latency and throughput tradeoffs also come into play. GPUs excel at massively parallel throughput for inference. But CPUs can often provide lower latency for real-time requests. CPU inferencing may be preferred for applications like online recommendations that require sub-second response.

Additionally, CPU optimization for inferencing is progressing rapidly. Performance keeps improving, driven by faster clocks, more cores, and new instructions like Intel AVX-512 and AMX, AI workloads can be run smoothly on CPUs alone and are especially good if the server is configured with more than a single socket, meaning that more AI engines are present and the server can efficiently process AI models with multi-billion parameters in size. Generally, Intel notes that models up to 20 billion parameters work fine on a CPU, whereas anything larger must go to a specialized accelerator.

AI models like GPT-4, Claude, and Gemini are huge models that can reach more than a trillion parameter sizes. However, they are multi-modal, meaning they process text and video. A real-world enterprise workload may be an AI model inferencing a company's local documents to answer customer support questions. Running a model like GPT-4 would be an overkill for that solution. In contrast, significantly smaller models like LLAMA 2 or Mistral can serve similar purposes exceptionally well without requiring third-party API access and instead run on a local or cloud server with a few CPUs. This translates into lower total cost of ownership (TCO) and simplified AI pipelines.
Sources: TechFieldDay (YouTube), Keith Townsend (X/Twitter), Gelstalt IT
Add your own comment

6 Comments on Google: CPUs are Leading AI Inference Workloads, Not GPUs

#1
nguyen
Well I guess Google Gemini has been running on CPUs LOL.

The article saying GPU is better at training ML, would it just make more sense to use newer GPU to train and outdated GPU for inferencing instead of GPU for training and CPU for inferencing?
Posted on Reply
#3
AleksandarK
News Editor
nguyenWell I guess Google Gemini has been running on CPUs LOL.

The article saying GPU is better at training ML, would it just make more sense to use newer GPU to train and outdated GPU for inferencing instead of GPU for training and CPU for inferencing?
Gemini runs on Google TPUs.

The article states that models up to 20 billion parameters are fine on CPU, where anything larger goes to GPU/TPU/xPU.

Models from Mistral and Meta in 7-13B range are ideal for CPU inference, which is a real world use case. Not everyone needs a GPT4 API for simpler tasks.
Posted on Reply
#4
nguyen
AleksandarKGemini runs on Google TPUs.

The article states that models up to 20 billion parameters are fine on CPU, where anything larger goes to GPU/TPU/xPU.

Models from Mistral and Meta in 7-13B range are ideal for CPU inference, which is a real world use case. Not everyone needs a GPT4 API for simpler tasks.
But after you use GPU to train the model, it's just cheaper to use those same GPU for inferencing task instead of buying CPU?

Why would any non-tech company need to keep on training AI for simple tasks?
Posted on Reply
#5
AleksandarK
News Editor
nguyenBut after you use GPU to train the model, it's just cheaper to use those same GPU for inferencing task instead of buying CPU?

Why would any non-tech company need to keep on training AI for simple tasks?
The training infra is never idle and is always training new models. Plus, not everyone has access to those GPUs. Training is exclusive to a few, inference is available to many.
Posted on Reply
#6
Firedrops
Google: Our GPU instances cost about 50 billion times more than CPU instances or vs our competitors' GPU instances.

Also Google: WOW CPU IS ALL YOU NEED!
Posted on Reply
Dec 3rd, 2024 13:02 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts