Friday, February 16th 2024
NVIDIA Unveils "Eos" to Public - a Top Ten Supercomputer
Providing a peek at the architecture powering advanced AI factories, NVIDIA released a video that offers the first public look at Eos, its latest data-center-scale supercomputer. An extremely large-scale NVIDIA DGX SuperPOD, Eos is where NVIDIA developers create their AI breakthroughs using accelerated computing infrastructure and fully optimized software. Eos is built with 576 NVIDIA DGX H100 systems, NVIDIA Quantum-2 InfiniBand networking and software, providing a total of 18.4 exaflops of FP8 AI performance. Revealed in November at the Supercomputing 2023 trade show, Eos—named for the Greek goddess said to open the gates of dawn each day—reflects NVIDIA's commitment to advancing AI technology.
Eos Supercomputer Fuels Innovation
Each DGX H100 system is equipped with eight NVIDIA H100 Tensor Core GPUs. Eos features a total of 4,608 H100 GPUs. As a result, Eos can handle the largest AI workloads to train large language models, recommender systems, quantum simulations and more. It's a showcase of what NVIDIA's technologies can do, when working at scale. Eos is arriving at the perfect time. People are changing the world with generative AI, from drug discovery to chatbots to autonomous machines and beyond. To achieve these breakthroughs, they need more than AI expertise and development skills. They need an AI factory—a purpose-built AI engine that's always available and can help ramp their capacity to build AI models at scale Eos delivers. Ranked No. 9 in the TOP 500 list of the world's fastest supercomputers, Eos pushes the boundaries of AI technology and infrastructure.It includes NVIDIA's advanced accelerated computing and networking alongside sophisticated software offerings such as NVIDIA Base Command and NVIDIA AI Enterprise.
Eos's architecture is optimized for AI workloads demanding ultra-low-latency and high-throughput interconnectivity across a large cluster of accelerated computing nodes, making it an ideal solution for enterprises looking to scale their AI capabilities. Based on NVIDIA Quantum-2 InfiniBand with In-Network Computing technology, its network architecture supports data transfer speeds of up to 400 Gb/s, facilitating the rapid movement of large datasets essential for training complex AI models.
At the heart of Eos lies the groundbreaking DGX SuperPOD architecture powered by NVIDIA's DGX H100 systems. The architecture is built to provide the AI and computing fields with tightly integrated full-stack systems capable of computing at an enormous scale. As enterprises and developers worldwide seek to harness the power of AI, Eos stands as a pivotal resource, promising to accelerate the journey towards AI-infused applications that fuel every organization.
Sources:
NVIDIA Blog, ServeTheHome
Eos Supercomputer Fuels Innovation
Each DGX H100 system is equipped with eight NVIDIA H100 Tensor Core GPUs. Eos features a total of 4,608 H100 GPUs. As a result, Eos can handle the largest AI workloads to train large language models, recommender systems, quantum simulations and more. It's a showcase of what NVIDIA's technologies can do, when working at scale. Eos is arriving at the perfect time. People are changing the world with generative AI, from drug discovery to chatbots to autonomous machines and beyond. To achieve these breakthroughs, they need more than AI expertise and development skills. They need an AI factory—a purpose-built AI engine that's always available and can help ramp their capacity to build AI models at scale Eos delivers. Ranked No. 9 in the TOP 500 list of the world's fastest supercomputers, Eos pushes the boundaries of AI technology and infrastructure.It includes NVIDIA's advanced accelerated computing and networking alongside sophisticated software offerings such as NVIDIA Base Command and NVIDIA AI Enterprise.
Eos's architecture is optimized for AI workloads demanding ultra-low-latency and high-throughput interconnectivity across a large cluster of accelerated computing nodes, making it an ideal solution for enterprises looking to scale their AI capabilities. Based on NVIDIA Quantum-2 InfiniBand with In-Network Computing technology, its network architecture supports data transfer speeds of up to 400 Gb/s, facilitating the rapid movement of large datasets essential for training complex AI models.
At the heart of Eos lies the groundbreaking DGX SuperPOD architecture powered by NVIDIA's DGX H100 systems. The architecture is built to provide the AI and computing fields with tightly integrated full-stack systems capable of computing at an enormous scale. As enterprises and developers worldwide seek to harness the power of AI, Eos stands as a pivotal resource, promising to accelerate the journey towards AI-infused applications that fuel every organization.
20 Comments on NVIDIA Unveils "Eos" to Public - a Top Ten Supercomputer
iirc DGX superPOD use AMD Rome Epyc, do they still?
interesting top 10 nonetheless
www.nvidia.com/en-us/data-center/grace-cpu/
The A100 DGX was AMD based, rumor is they wouldn't give them a discount this time around so they went Intel who would.
My guess is they need to try some mid cored Genoa with higher clocks, but it might be architectural and scheduler issues. And divide by 60 to get fp64 rating.
how long does it take to build a supercomputer - from tech spec to fully commissioned and operational?
But if Nvidia decides to make Eos a sellable physical product, equal to this first Eos for the most part, then it shouldn't take more than a few months. Large companies might be interested, they would get a field-tested system with a predictable performance and a relatively short delivery time.
4000 GPUs(mi300x) x 5.2Pflops = 20.8 Exaflops FP8
4608 mi300x = 23,9Exaflops
:cool:
The Top500 run was in May of 2022. The first design talks for an exascale supercomputer started at the beginning of the 2010's and the primary concern at the time was whether or not an exascale computer could be built and only consume 25 MW of electricity or less. This was a constraint imposed by the US Department of Energy due to the government not wanting to spend a buttload of money on energy costs. Cost for Frontier was around 500 to 600 million USD. The cost of the actual Exascale Computing Project (updating many large software and application products to use CPU/GPUs at these large scales) is 1.8 billion USD.
Source: Al Geist's (corporate fellow, ORNL) presentation talk at the Exascale Computing Project's 2023 Independent Project Review
Source2: I work in the project office for the ECP