Monday, May 13th 2024
Intel-powered Aurora Supercomputer Ranks Fastest for AI
At ISC High Performance 2024, Intel announced in collaboration with Argonne National Laboratory and Hewlett Packard Enterprise (HPE) that the Aurora supercomputer has broken the exascale barrier at 1.012 exaflops and is the fastest AI system in the world dedicated to AI for open science, achieving 10.6 AI exaflops. Intel will also detail the crucial role of open ecosystems in driving AI-accelerated high performancehigh -performance computing (HPC). "The Aurora supercomputer surpassing exascale will allow it to pave the road to tomorrow's discoveries. From understanding climate patterns to unraveling the mysteries of the universe, supercomputers serve as a compass guiding us toward solving truly difficult scientific challenges that may improve humanity," said Ogi Brkic, Intel vice president and general manager of Data Center AI Solutions.
Designed as an AI-centric system from its inception, Aurora will allow researchers to harness generative AI models to accelerate scientific discovery. Significant progress has been made in Argonne's early AI-driven research. Success stories include mapping the human brain's 80 billion neurons, high-energy particle physics enhanced by deep learning, and drug design and discovery accelerated by machine learning, among others. The Aurora supercomputer is an expansive system with 166 racks, 10,624 compute blades, 21,248 Intel Xeon CPU Max Series processors, and 63,744 Intel Data Center GPU Max Series units, making it one of the world's largest GPU clusters.Aurora also includes the largest open, Ethernet-based supercomputing interconnect on a single system of 84,992 HPE slingshot fabric endpoints. Aurora supercomputer came in second on the high-performance LINPACK (HPL) benchmark but broke the exascale barrier at 1.012 exaflops utilizing 9,234 nodes, only 87% of the system. Aurora supercomputer also secured the third spot on the high-performance conjugate gradient (HPCG) benchmark at 5,612 TeraFLOPS per second (TF/s) with 39% of the machine. This benchmark aims to assess more realistic scenarios providing insights into communication and memory access patterns, which are important factors in real-world HPC applications. It complements benchmarks like LINPACK by offering a comprehensive view of a system's capabilities.
At the heart of the Aurora supercomputer is the Intel Data Center GPU Max Series. The Intel Xe GPU architecture is foundational to the Max Series, featuring specialized hardware like matrix and vector compute blocks optimized for both AI and HPC tasks. The Intel Xe architecture's design that delivers unparalleled compute performance is the reason the Aurora supercomputer secured the top spot in the high-performance LINPACK-mixed precision (HPL-MxP) benchmark - which best highlights the importance of AI workloads in HPC.
The Xe architecture's parallel processing capabilities excel in managing the intricate matrix-vector operations inherent in neural network AI computation. These compute cores are pivotal in accelerating matrix operations crucial for deep learning models. Complemented by Intel's suite of software tools, including Intel oneAPI DPC++/C++ Compiler, a rich set of performance libraries, and optimized AI frameworks and tools, the Xe architecture fosters an open ecosystem for developers that is characterized by flexibility and scalability across various devices and form factors.
In his special session at ISC 2024, on Tuesday, May 14 at 6:45 p.m., (GMT+2) Hall 4, Congress Center Hamburg, Germany, CEO Andrew Richards of Codeplay, an Intel company, will address the growing demand for accelerated computing and software in HPC and AI. He will highlight the importance of oneAPI, offering a unified programming model across diverse architectures. Built on open standards, oneAPI empowers developers to craft code that seamlessly runs on different hardware platforms without extensive modifications or vendor lock-in. This is also the goal of the Linux Foundation's Unified Acceleration Foundation (UXL), in which Arm, Google, Intel, Qualcomm and others are developing an open ecosystem for all accelerators and unified heterogeneous compute on open standards to break proprietary lock-in. The UXL Foundation is adding more members to its growing coalition.
Meanwhile, Intel Tiber Developer Cloud is expanding its compute capacity with new state-of-the-art hardware platforms and new service capabilities allowing enterprises and developers to evaluate the latest Intel architecture, to innovate and optimize AI models and workloads quickly, and then to deploy AI models at scale. New hardware includes previews of Intel Xeon 6 E-core and P-core systems for select customers, and large-scale Intel Gaudi 2-based and Intel Data Center GPU Max Series-based clusters. New capabilities include Intel Kubernetes Service for cloud-native AI training and inference workloads and multiuser accounts.
New supercomputers being deployed with Intel Xeon CPU Max Series and Intel Data Center GPU Max Series technologies underscore Intel's goal to advance HPC and AI. Systems include Euro-Mediterranean Centre on Climate Change's (CMCC) Cassandra to accelerate climate change modeling; Italian National Agency for New Technologies, Energy and Sustainable Economic Development's (ENEA) CRESCO 8 to enable breakthroughs in fusion energy; Texas Advanced Computing Center (TACC), which is in full production to enable data analysis in biology to supersonic turbulence flows and atomistic simulations on a wide range of materials; as well as United Kingdom Atomic Energy Authority (UKAEA) to solve memory-bound problems that underpin the design of future fusion powerplants.
The result from the mixed-precision AI benchmark will be foundational for Intel's next-generation GPU for AI and HPC, code-named Falcon Shores. Falcon Shores will leverage the next-generation Intel Xe architecture with the best of Intel Gaudi. This integration enables a unified programming interface.
Early performance results on Intel Xeon 6 with P-cores and Multiplexer Combined Ranks (MCR) memory at 8800 megatransfers per second (MT/s) deliver up to 2.3x performance improvement for real-world HPC applications, like Nucleus for European Modeling of the Ocean (NEMO), when compared to the previous generation, setting a strong foundation as the preferred host CPU choice for HPC solutions.
Designed as an AI-centric system from its inception, Aurora will allow researchers to harness generative AI models to accelerate scientific discovery. Significant progress has been made in Argonne's early AI-driven research. Success stories include mapping the human brain's 80 billion neurons, high-energy particle physics enhanced by deep learning, and drug design and discovery accelerated by machine learning, among others. The Aurora supercomputer is an expansive system with 166 racks, 10,624 compute blades, 21,248 Intel Xeon CPU Max Series processors, and 63,744 Intel Data Center GPU Max Series units, making it one of the world's largest GPU clusters.Aurora also includes the largest open, Ethernet-based supercomputing interconnect on a single system of 84,992 HPE slingshot fabric endpoints. Aurora supercomputer came in second on the high-performance LINPACK (HPL) benchmark but broke the exascale barrier at 1.012 exaflops utilizing 9,234 nodes, only 87% of the system. Aurora supercomputer also secured the third spot on the high-performance conjugate gradient (HPCG) benchmark at 5,612 TeraFLOPS per second (TF/s) with 39% of the machine. This benchmark aims to assess more realistic scenarios providing insights into communication and memory access patterns, which are important factors in real-world HPC applications. It complements benchmarks like LINPACK by offering a comprehensive view of a system's capabilities.
At the heart of the Aurora supercomputer is the Intel Data Center GPU Max Series. The Intel Xe GPU architecture is foundational to the Max Series, featuring specialized hardware like matrix and vector compute blocks optimized for both AI and HPC tasks. The Intel Xe architecture's design that delivers unparalleled compute performance is the reason the Aurora supercomputer secured the top spot in the high-performance LINPACK-mixed precision (HPL-MxP) benchmark - which best highlights the importance of AI workloads in HPC.
The Xe architecture's parallel processing capabilities excel in managing the intricate matrix-vector operations inherent in neural network AI computation. These compute cores are pivotal in accelerating matrix operations crucial for deep learning models. Complemented by Intel's suite of software tools, including Intel oneAPI DPC++/C++ Compiler, a rich set of performance libraries, and optimized AI frameworks and tools, the Xe architecture fosters an open ecosystem for developers that is characterized by flexibility and scalability across various devices and form factors.
In his special session at ISC 2024, on Tuesday, May 14 at 6:45 p.m., (GMT+2) Hall 4, Congress Center Hamburg, Germany, CEO Andrew Richards of Codeplay, an Intel company, will address the growing demand for accelerated computing and software in HPC and AI. He will highlight the importance of oneAPI, offering a unified programming model across diverse architectures. Built on open standards, oneAPI empowers developers to craft code that seamlessly runs on different hardware platforms without extensive modifications or vendor lock-in. This is also the goal of the Linux Foundation's Unified Acceleration Foundation (UXL), in which Arm, Google, Intel, Qualcomm and others are developing an open ecosystem for all accelerators and unified heterogeneous compute on open standards to break proprietary lock-in. The UXL Foundation is adding more members to its growing coalition.
Meanwhile, Intel Tiber Developer Cloud is expanding its compute capacity with new state-of-the-art hardware platforms and new service capabilities allowing enterprises and developers to evaluate the latest Intel architecture, to innovate and optimize AI models and workloads quickly, and then to deploy AI models at scale. New hardware includes previews of Intel Xeon 6 E-core and P-core systems for select customers, and large-scale Intel Gaudi 2-based and Intel Data Center GPU Max Series-based clusters. New capabilities include Intel Kubernetes Service for cloud-native AI training and inference workloads and multiuser accounts.
New supercomputers being deployed with Intel Xeon CPU Max Series and Intel Data Center GPU Max Series technologies underscore Intel's goal to advance HPC and AI. Systems include Euro-Mediterranean Centre on Climate Change's (CMCC) Cassandra to accelerate climate change modeling; Italian National Agency for New Technologies, Energy and Sustainable Economic Development's (ENEA) CRESCO 8 to enable breakthroughs in fusion energy; Texas Advanced Computing Center (TACC), which is in full production to enable data analysis in biology to supersonic turbulence flows and atomistic simulations on a wide range of materials; as well as United Kingdom Atomic Energy Authority (UKAEA) to solve memory-bound problems that underpin the design of future fusion powerplants.
The result from the mixed-precision AI benchmark will be foundational for Intel's next-generation GPU for AI and HPC, code-named Falcon Shores. Falcon Shores will leverage the next-generation Intel Xe architecture with the best of Intel Gaudi. This integration enables a unified programming interface.
Early performance results on Intel Xeon 6 with P-cores and Multiplexer Combined Ranks (MCR) memory at 8800 megatransfers per second (MT/s) deliver up to 2.3x performance improvement for real-world HPC applications, like Nucleus for European Modeling of the Ocean (NEMO), when compared to the previous generation, setting a strong foundation as the preferred host CPU choice for HPC solutions.
35 Comments on Intel-powered Aurora Supercomputer Ranks Fastest for AI
While AMD-based Frontier (#1 currently in absolute performance) is at 52.927 GFlops/watts, and one of the newest NVIDIA GH200-based JEDI is at 72.733 GFlops/watts.
source
In the end Intel will be slower than Nvidia in AI, slower in everything else compared to AMD and a joke in efficiency compared to either AMD or Nvidia.
It’s time the general market realizes that the corporate structure at Intel is no longer able to run the company. It can only run the company into the ground.
Edit: “Designed as an AI-centric system from its inception”. Funny nothing about AI was mentioned In the 2015 announcement.
www.intc.com/news-events/press-releases/detail/344/intel-selected-by-u-s-department-of-energy-to-deliver
Not to mention that if Intel was thinking AI in 2015, it should have been the main competitor to Nvidia today.
Edit: For co-accelerators: Intel -5- AMD -14- Nvidia -A WHOLE LOT MORE!-
Intel keeps bragging about their process nodes roadmap, but dont see this paying off in real products, TSMC is handing intels a$$ to it. If arrowlake is a dissapointment then they are in catastrophic trouble, dont see how removing hyperthreading, and adding more e-cores will change anything. Rather have 8 really powerfull cores, and 3D vcache, and less power consumption. For a server, don't want hetergenous architecture, its screws things up badly, and where's AVX512, why can't consumer space have more than just enough lanes to run a GPU and m.2 nvme only? Stuff is stagnating here boys.
A "supercomputer" like this is just an optimized network of individual computers.
Intel stock up as $11B Apollo deal nears completion: WSJ
We can laugh at him constantly for the next 1-2-3 years, but if his plan works, if he builds those fabs, are competitive or even better than TSMCs, Intel will become again a successful behemoth and most of all, it's fate wouldn't be depended on x86 success/survival against ARM or whatever else comes in the future.
Your PC has a CPU and a GPU, usually with their own separate physical memory, they can't interface with each other directly, they can through software same as in a supercomputer, does your PC count as a cluster ? A cluster is made out of computers but a cluster is not a computer even though they all can access each other's memory, which is the point of having a cluster ? Man, it's crazy to think that a couple of years ago the roles would have been completely reversed but the argument would have been the same. :roll:
In most HPC clusters, even with technologies like RDMA, there is no uniform physical address space so that's not possible. There will be an explicit translation layer from physical address space of one node to physical address space of another. It can be made to look like it's uniform with PGAS, but it's not at the hardware level.
I wrote "most" because there are some specialized designs that do have uniform physical address space across multiple nodes like IBM Power10 with PowerAXON, and that is done at the hardware level. Sure they can interface directly. That's one of the ways GPU drivers communicate with hardware - via PCI BARs (previously in up to 256MB windows, but with ReBAR this can be exceeded). This mechanism can also be used to facilitate direct communication between PCIe devices like network cards, an example of which is NVIDIA GPUDirect RDMA.
You can't run normal software on a cluster node and expect it to magically be able to utilize memory on every node of it. Why do you switch from physical to virtual in this argument? Not easily.
What is, in your opinion, the difference between this supercomputer on one hand and a couple of racks of "normal" machines with very fast networking?
I mean from a software/programming standpoint.
Clusters with proper memory sharing are rare.