- Joined
- Aug 19, 2017
- Messages
- 2,626 (0.98/day)
The AI infrastructure of today is mostly fueled by the expansion that relies on GPU-accelerated servers. Google, one of the world's largest hyperscalers, has noted that CPUs are still a leading compute for AI/ML workloads, recorded on their Google Cloud Services cloud internal analysis. During the TechFieldDay event, a speech by Brandon Royal, product manager at Google Cloud, explained the position of CPUs in today's AI game. The AI lifecycle is divided into two parts: training and inference. During training, massive compute capacity is needed, along with enormous memory capacity, to fit ever-expanding AI models into memory. The latest models, like GPT-4 and Gemini, contain billions of parameters and require thousands of GPUs or other accelerators working in parallel to train efficiently.
On the other hand, inference requires less compute intensity but still benefits from acceleration. The pre-trained model is optimized and deployed during inference to make predictions on new data. While less compute is needed than training, latency and throughput are essential for real-time inference. Google found out that, while GPUs are ideal for the training phase, models are often optimized and run inference on CPUs. This means that there are customers who choose CPUs as their medium of AI inference for a wide variety of reasons.
It can be a matter of cost and availability. CPUs tend to be cheaper and more readily available than high-end GPUs or specialized AI accelerators. For many applications, a CPU provides sufficient performance for inference at a lower cost. CPUs also offer flexibility. Since most systems already have CPUs, they provide an easy deployment path for smaller AI models. GPUs often require specialized libraries and drivers, while CPU-based inference can leverage existing infrastructure. This makes it simpler to integrate AI into existing products and workflows. Latency and throughput tradeoffs also come into play. GPUs excel at massively parallel throughput for inference. But CPUs can often provide lower latency for real-time requests. CPU inferencing may be preferred for applications like online recommendations that require sub-second response.
Additionally, CPU optimization for inferencing is progressing rapidly. Performance keeps improving, driven by faster clocks, more cores, and new instructions like Intel AVX-512 and AMX, AI workloads can be run smoothly on CPUs alone and are especially good if the server is configured with more than a single socket, meaning that more AI engines are present and the server can efficiently process AI models with multi-billion parameters in size. Generally, Intel notes that models up to 20 billion parameters work fine on a CPU, whereas anything larger must go to a specialized accelerator.
AI models like GPT-4, Claude, and Gemini are huge models that can reach more than a trillion parameter sizes. However, they are multi-modal, meaning they process text and video. A real-world enterprise workload may be an AI model inferencing a company's local documents to answer customer support questions. Running a model like GPT-4 would be an overkill for that solution. In contrast, significantly smaller models like LLAMA 2 or Mistral can serve similar purposes exceptionally well without requiring third-party API access and instead run on a local or cloud server with a few CPUs. This translates into lower total cost of ownership (TCO) and simplified AI pipelines.
View at TechPowerUp Main Site | Source
On the other hand, inference requires less compute intensity but still benefits from acceleration. The pre-trained model is optimized and deployed during inference to make predictions on new data. While less compute is needed than training, latency and throughput are essential for real-time inference. Google found out that, while GPUs are ideal for the training phase, models are often optimized and run inference on CPUs. This means that there are customers who choose CPUs as their medium of AI inference for a wide variety of reasons.
It can be a matter of cost and availability. CPUs tend to be cheaper and more readily available than high-end GPUs or specialized AI accelerators. For many applications, a CPU provides sufficient performance for inference at a lower cost. CPUs also offer flexibility. Since most systems already have CPUs, they provide an easy deployment path for smaller AI models. GPUs often require specialized libraries and drivers, while CPU-based inference can leverage existing infrastructure. This makes it simpler to integrate AI into existing products and workflows. Latency and throughput tradeoffs also come into play. GPUs excel at massively parallel throughput for inference. But CPUs can often provide lower latency for real-time requests. CPU inferencing may be preferred for applications like online recommendations that require sub-second response.
Additionally, CPU optimization for inferencing is progressing rapidly. Performance keeps improving, driven by faster clocks, more cores, and new instructions like Intel AVX-512 and AMX, AI workloads can be run smoothly on CPUs alone and are especially good if the server is configured with more than a single socket, meaning that more AI engines are present and the server can efficiently process AI models with multi-billion parameters in size. Generally, Intel notes that models up to 20 billion parameters work fine on a CPU, whereas anything larger must go to a specialized accelerator.
AI models like GPT-4, Claude, and Gemini are huge models that can reach more than a trillion parameter sizes. However, they are multi-modal, meaning they process text and video. A real-world enterprise workload may be an AI model inferencing a company's local documents to answer customer support questions. Running a model like GPT-4 would be an overkill for that solution. In contrast, significantly smaller models like LLAMA 2 or Mistral can serve similar purposes exceptionally well without requiring third-party API access and instead run on a local or cloud server with a few CPUs. This translates into lower total cost of ownership (TCO) and simplified AI pipelines.
View at TechPowerUp Main Site | Source