Thursday, November 10th 2022

Intel Delivers Leading AI Performance Results on MLPerf v2.1 Industry Benchmark for DL Training

Today, MLCommons published results of its industry AI performance benchmark in which both the 4th Generation Intel Xeon Scalable processor (code-named Sapphire Rapids) and Habana Gaudi 2 dedicated deep learning accelerator logged impressive training results.


"I'm proud of our team's continued progress since we last submitted leadership results on MLPerf in June. Intel's 4th gen Xeon Scalable processor and Gaudi 2 AI accelerator support a wide array of AI functions and deliver leadership performance for customers who require deep learning training and large-scale workloads." Sandra Rivera, Intel executive vice president and general manager of the Datacenter and AI Group
Why It Matters
In many data center use cases, deep learning (DL) is part of a complex pipeline of machine learning (ML) and data analytics running on Xeon-based servers that are also used to run other applications and are adaptable to workload demands changing over time. It is in these use cases that Xeon Scalable delivers the best total cost of ownership (TCO) and year-round utilization.

The 4th Generation Intel Xeon Scalable processor with Intel Advanced Matrix Extensions (AMX), a new built-in AI accelerator, allows customers to extend the general-purpose Xeon server platform to cover even more DL use cases, including DL training and fine tuning. AMX is a dedicated matrix multiplication engine built into every core of 4th Gen Intel Xeon Scalable processors. This dedicated AI engine is optimized to deliver up to 6x higher gen-to-gen DL training model performance using industry standard frameworks.

In cases where the server or a cluster of servers are predominantly used for DL training and inference compute, the Habana Gaudi2 accelerator is the optimal accelerator. It is purpose-designed to deliver the best DL performance and TCO for these dedicated use cases.

About the Results for Xeon
Intel submitted MLPerf Training v2.1 results on the 4th Gen Intel Xeon Scalable processor product line across a range of workloads. Intel Xeon Scalable Processor was the only CPU submitted for MLPerf v2.1, once again demonstrating it is the best server CPU for AI training, which enables customers to use their shared infrastructure to train anywhere, anytime. The 4th Gen Intel Xeon Scalable processors with Intel AMX deliver this performance out of the box across multiple industry standard frameworks and integrated with end-to-end data science tools and a broad ecosystem of smart solutions from partners. Developers only need to use the latest framework releases of TensorFlow and PyTorch to unleash this performance. Intel Xeon Scalable can now run any AI workload.

Intel's results show that 4th Gen Intel Xeon Scalable processors are expanding the reach of general-purpose CPUs for AI training, so customers can do more with Xeons that are already running their businesses.This is especially true for training medium to small models or transfer learning (aka fine tuning). The DLRM results are great examples of where we were able to train the model in less than 30 minutes (26.73) with only four server nodes. Even for mid-sized and larger models, 4th Gen Xeon processors could train BERT and ResNet-50 models in less than 50 minutes (47.26) and less than 90 minutes (89.01), respectively. Developers can now train small DL models over a coffee break, mid-sized models over lunch and use those same servers connected to data storage systems to utilize other analytics techniques like classical machine learning in the afternoon. This allows the enterprise to conserve deep learning processors, like Gaudi2, for the largest, most demanding models.
About the Results for Habana Gaudi2
Gaudi2, Habana's second-generation DL processor, launched in May and submitted leadership results on MLPerf v2.0 training 10 days later. Gaudi2, produced in 7 nanometer process and featuring 24 tensor processor cores, 96 GB on-board HBM2e memory and 24 100 integrated gigabit Ethernet ports, has again shown leading eight-card server performance on the benchmark compared to Nvidia's A100.

As shown here, Gaudi2 improved by 10% for time-to-train in TensorFlow for both BERT and ResNet-50, and reported results on PyTorch, which achieved 4% and 6% TTT advantage for BERT and ResNet-50, respectively, over the May Gaudi2 submission. Both sets of results were submitted in the closed and available categories.

These rapid advances underscore the uniqueness of the Gaudi2 purpose-built DL architecture, the increasing maturity of Gaudi2 software and expansion of the Habana SynapseAI software stack, optimized for deep learning model development and deployment.

As further evidence of the strength of the results, Gaudi2 continued to outperform the Nvidia A100 for both BERT and ResNet-50, as it did in the May submission and shown here. In addition, it's notable that Nvidia's H100 ResNet-50 TTT is only 11% faster than the Gaudi2 performance. And though the H100 is 59% faster than Gaudi2 on BERT, it is worth noting that Nvidia reported BERT TTT in the FP8 data type, while Gaudi2 TTT is on standard, verified BF16 data type (with FP8 enablement in the software plans for Gaudi2). Gaudi2 offers meaningful price-performance improvement versus both A100 and H100.

The Intel and Habana team look forward to its next MLPerf submissions for Intel AI portfolio solutions.
Source: Intel
Add your own comment

3 Comments on Intel Delivers Leading AI Performance Results on MLPerf v2.1 Industry Benchmark for DL Training

#1
DeathtoGnomes
TheLostSwedeHabana Gaudi 2
replace this with Humma Kavula.
Thats Gaudy enough, right?

It should be noted that MLPerf scores are kind of averaged, it discards the highest and lowest scores. Results for Xeon looks promising but still likely wont beat Epyc on its own.
To account for the substantial variance in ML training times, final results are obtained by measuring the benchmark a benchmark-specific number of times, discarding the lowest and highest results, and averaging the remaining results. Even the multiple result average is not sufficient to eliminate all variance. Imaging benchmark results are very roughly +/- 2.5% and other benchmarks are very roughly +/- 5%.
mlcommons.org/en/training-normal-21/
Posted on Reply
#2
Minus Infinity
DeathtoGnomesreplace this with Humma Kavula.
Thats Gaudy enough, right?

It should be noted that MLPerf scores are kind of averaged, it discards the highest and lowest scores. Results for Xeon looks promising but still likely wont beat Epyc on its own.



mlcommons.org/en/training-normal-21/
AI accelerators can have a huge uplift vs core counts. If Epyc has no accelerators, their core counts won't save them in specific tasks that Intel has accelerated. I believe Turin will have a lot of accelerators but Genoa deosn't seem to mention much in that way.
Posted on Reply
#3
First Strike
DeathtoGnomesResults for Xeon looks promising but still likely wont beat Epyc on its own.
I would be surprised if Epyc can somehow beat Sapphire Rapids in BF16 mixed precision ML benchmarks with AMX on. Because Intel did put a extremely oversized AMX in it along with HBM to feed AMX units.
Posted on Reply
Dec 29th, 2024 21:10 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts