Friday, April 28th 2023
NVIDIA H100 Compared to A100 for Training GPT Large Language Models
NVIDIA's H100 has recently become available to use via Cloud Service Providers (CSPs), and it was only a matter of time before someone decided to benchmark its performance and compare it to the previous generation's A100 GPU. Today, thanks to the benchmarks of MosaicML, a startup company led by the ex-CEO of Nervana and GM of Artificial Intelligence (AI) at Intel, Naveen Rao, we have some comparison between these two GPUs with a fascinating insight about the cost factor. Firstly, MosaicML has taken Generative Pre-trained Transformer (GPT) models of various sizes and trained them using bfloat16 and FP8 Floating Point precision formats. All training occurred on CoreWeave cloud GPU instances.
Regarding performance, the NVIDIA H100 GPU achieved anywhere from 2.2x to 3.3x speedup. However, an interesting finding emerges when comparing the cost of running these GPUs in the cloud. CoreWeave prices the H100 SXM GPUs at $4.76/hr/GPU, while the A100 80 GB SXM gets $2.21/hr/GPU pricing. While the H100 is 2.2x more expensive, the performance makes it up, resulting in less time to train a model and a lower price for the training process. This inherently makes H100 more attractive for researchers and companies wanting to train Large Language Models (LLMs) and makes choosing the newer GPU more viable, despite the increased cost. Below, you can see tables of comparison between two GPUs in training time, speedup, and cost of training.
Source:
MosaicML
Regarding performance, the NVIDIA H100 GPU achieved anywhere from 2.2x to 3.3x speedup. However, an interesting finding emerges when comparing the cost of running these GPUs in the cloud. CoreWeave prices the H100 SXM GPUs at $4.76/hr/GPU, while the A100 80 GB SXM gets $2.21/hr/GPU pricing. While the H100 is 2.2x more expensive, the performance makes it up, resulting in less time to train a model and a lower price for the training process. This inherently makes H100 more attractive for researchers and companies wanting to train Large Language Models (LLMs) and makes choosing the newer GPU more viable, despite the increased cost. Below, you can see tables of comparison between two GPUs in training time, speedup, and cost of training.
2 Comments on NVIDIA H100 Compared to A100 for Training GPT Large Language Models
Having your own little data center is expensive.
Also forgot to mention the real estate.