Monday, May 13th 2024

Intel-powered Aurora Supercomputer Ranks Fastest for AI

At ISC High Performance 2024, Intel announced in collaboration with Argonne National Laboratory and Hewlett Packard Enterprise (HPE) that the Aurora supercomputer has broken the exascale barrier at 1.012 exaflops and is the fastest AI system in the world dedicated to AI for open science, achieving 10.6 AI exaflops. Intel will also detail the crucial role of open ecosystems in driving AI-accelerated high performancehigh -performance computing (HPC). "The Aurora supercomputer surpassing exascale will allow it to pave the road to tomorrow's discoveries. From understanding climate patterns to unraveling the mysteries of the universe, supercomputers serve as a compass guiding us toward solving truly difficult scientific challenges that may improve humanity," said Ogi Brkic, Intel vice president and general manager of Data Center AI Solutions.

Designed as an AI-centric system from its inception, Aurora will allow researchers to harness generative AI models to accelerate scientific discovery. Significant progress has been made in Argonne's early AI-driven research. Success stories include mapping the human brain's 80 billion neurons, high-energy particle physics enhanced by deep learning, and drug design and discovery accelerated by machine learning, among others. The Aurora supercomputer is an expansive system with 166 racks, 10,624 compute blades, 21,248 Intel Xeon CPU Max Series processors, and 63,744 Intel Data Center GPU Max Series units, making it one of the world's largest GPU clusters.
Aurora also includes the largest open, Ethernet-based supercomputing interconnect on a single system of 84,992 HPE slingshot fabric endpoints. Aurora supercomputer came in second on the high-performance LINPACK (HPL) benchmark but broke the exascale barrier at 1.012 exaflops utilizing 9,234 nodes, only 87% of the system. Aurora supercomputer also secured the third spot on the high-performance conjugate gradient (HPCG) benchmark at 5,612 TeraFLOPS per second (TF/s) with 39% of the machine. This benchmark aims to assess more realistic scenarios providing insights into communication and memory access patterns, which are important factors in real-world HPC applications. It complements benchmarks like LINPACK by offering a comprehensive view of a system's capabilities.

At the heart of the Aurora supercomputer is the Intel Data Center GPU Max Series. The Intel Xe GPU architecture is foundational to the Max Series, featuring specialized hardware like matrix and vector compute blocks optimized for both AI and HPC tasks. The Intel Xe architecture's design that delivers unparalleled compute performance is the reason the Aurora supercomputer secured the top spot in the high-performance LINPACK-mixed precision (HPL-MxP) benchmark - which best highlights the importance of AI workloads in HPC.

The Xe architecture's parallel processing capabilities excel in managing the intricate matrix-vector operations inherent in neural network AI computation. These compute cores are pivotal in accelerating matrix operations crucial for deep learning models. Complemented by Intel's suite of software tools, including Intel oneAPI DPC++/C++ Compiler, a rich set of performance libraries, and optimized AI frameworks and tools, the Xe architecture fosters an open ecosystem for developers that is characterized by flexibility and scalability across various devices and form factors.

In his special session at ISC 2024, on Tuesday, May 14 at 6:45 p.m., (GMT+2) Hall 4, Congress Center Hamburg, Germany, CEO Andrew Richards of Codeplay, an Intel company, will address the growing demand for accelerated computing and software in HPC and AI. He will highlight the importance of oneAPI, offering a unified programming model across diverse architectures. Built on open standards, oneAPI empowers developers to craft code that seamlessly runs on different hardware platforms without extensive modifications or vendor lock-in. This is also the goal of the Linux Foundation's Unified Acceleration Foundation (UXL), in which Arm, Google, Intel, Qualcomm and others are developing an open ecosystem for all accelerators and unified heterogeneous compute on open standards to break proprietary lock-in. The UXL Foundation is adding more members to its growing coalition.


Meanwhile, Intel Tiber Developer Cloud is expanding its compute capacity with new state-of-the-art hardware platforms and new service capabilities allowing enterprises and developers to evaluate the latest Intel architecture, to innovate and optimize AI models and workloads quickly, and then to deploy AI models at scale. New hardware includes previews of Intel Xeon 6 E-core and P-core systems for select customers, and large-scale Intel Gaudi 2-based and Intel Data Center GPU Max Series-based clusters. New capabilities include Intel Kubernetes Service for cloud-native AI training and inference workloads and multiuser accounts.

New supercomputers being deployed with Intel Xeon CPU Max Series and Intel Data Center GPU Max Series technologies underscore Intel's goal to advance HPC and AI. Systems include Euro-Mediterranean Centre on Climate Change's (CMCC) Cassandra to accelerate climate change modeling; Italian National Agency for New Technologies, Energy and Sustainable Economic Development's (ENEA) CRESCO 8 to enable breakthroughs in fusion energy; Texas Advanced Computing Center (TACC), which is in full production to enable data analysis in biology to supersonic turbulence flows and atomistic simulations on a wide range of materials; as well as United Kingdom Atomic Energy Authority (UKAEA) to solve memory-bound problems that underpin the design of future fusion powerplants.

The result from the mixed-precision AI benchmark will be foundational for Intel's next-generation GPU for AI and HPC, code-named Falcon Shores. Falcon Shores will leverage the next-generation Intel Xe architecture with the best of Intel Gaudi. This integration enables a unified programming interface.

Early performance results on Intel Xeon 6 with P-cores and Multiplexer Combined Ranks (MCR) memory at 8800 megatransfers per second (MT/s) deliver up to 2.3x performance improvement for real-world HPC applications, like Nucleus for European Modeling of the Ocean (NEMO), when compared to the previous generation, setting a strong foundation as the preferred host CPU choice for HPC solutions.
Add your own comment

35 Comments on Intel-powered Aurora Supercomputer Ranks Fastest for AI

#1
ncrs
Too bad they only marginally increased power efficiency of Aurora, from 23.711 GFlops/watts in November 2023 to 26.151.
While AMD-based Frontier (#1 currently in absolute performance) is at 52.927 GFlops/watts, and one of the newest NVIDIA GH200-based JEDI is at 72.733 GFlops/watts.
source
Posted on Reply
#2
john_
Intel managed to score one win with over 2 times the power consumption and while Nvidia is getting ready to annihilate everything in AI benchmarks with it's latest chips.
In the end Intel will be slower than Nvidia in AI, slower in everything else compared to AMD and a joke in efficiency compared to either AMD or Nvidia.
Posted on Reply
#3
Daven
I hate to pile on but Aurora was a joke from start to finish. A cautionary tale of how not to deploy a supercomputer. Delayed multiple times it launched with only half its nodes back in Nov 2023. Target performance was suppose to be over 2 exaflops.

It’s time the general market realizes that the corporate structure at Intel is no longer able to run the company. It can only run the company into the ground.

Edit: “Designed as an AI-centric system from its inception”. Funny nothing about AI was mentioned In the 2015 announcement.

www.intc.com/news-events/press-releases/detail/344/intel-selected-by-u-s-department-of-energy-to-deliver
Posted on Reply
#4
unwind-protect
If it doesn't have shared memory it is not "a computer". It is a cluster.
Posted on Reply
#5
Vya Domus
unwind-protectIf it doesn't have shared memory it is not "a computer". It is a cluster.
Strange take, it obviously is a computer, I don't know what shared memory is even supposed to mean in this context.
Posted on Reply
#6
john_
DavenEdit: “Designed as an AI-centric system from its inception”. Funny nothing about AI was mentioned In the 2015 announcement.
The only one who will say that they where thinking AI in 2015 and I'll believe him, is Huang. Everybody else was either sleeping, or didn't had the means, financial or hardware, to set such a goal.
Not to mention that if Intel was thinking AI in 2015, it should have been the main competitor to Nvidia today.
Posted on Reply
#8
Daven
On related supercomputer news, AMD has increased its number of systems to 157 from a low of 2 systems in June 2019, an almost 8000% increase in 5 years. ARM based systems are at 16 now (Fujitsu and Nvidia). Intel continues to drop and has no hope of reversing its downward trajectory on the Top 500 list.

Edit: For co-accelerators: Intel -5- AMD -14- Nvidia -A WHOLE LOT MORE!-
Posted on Reply
#9
TheinsanegamerN
DavenI hate to pile on but Aurora was a joke from start to finish. A cautionary tale of how not to deploy a supercomputer. Delayed multiple times it launched with only half its nodes back in Nov 2023. Target performance was suppose to be over 2 exaflops.

It’s time the general market realizes that the corporate structure at Intel is no longer able to run the company. It can only run the company into the ground.

Edit: “Designed as an AI-centric system from its inception”. Funny nothing about AI was mentioned In the 2015 announcement.

www.intc.com/news-events/press-releases/detail/344/intel-selected-by-u-s-department-of-energy-to-deliver
I wonder how much longer Pat will last, he gotta be able to deliver things on time, or if they are late, blow expectations away. He's currently doing neither.
Posted on Reply
#10
Dr_b_
TheinsanegamerNI wonder how much longer Pat will last, he gotta be able to deliver things on time, or if they are late, blow expectations away. He's currently doing neither.
Yeah not really seeing anything positive out of intel, the products, if anything, are getting worse or staying bad. There is insane product segmentation and naming. Too much power consumption. Too much heat. Not enough features.

Intel keeps bragging about their process nodes roadmap, but dont see this paying off in real products, TSMC is handing intels a$$ to it. If arrowlake is a dissapointment then they are in catastrophic trouble, dont see how removing hyperthreading, and adding more e-cores will change anything. Rather have 8 really powerfull cores, and 3D vcache, and less power consumption. For a server, don't want hetergenous architecture, its screws things up badly, and where's AVX512, why can't consumer space have more than just enough lanes to run a GPU and m.2 nvme only? Stuff is stagnating here boys.
Posted on Reply
#11
Steevo
Remember everyone, Intel didn’t lie, they just candy coated the kickback truth for some.
Posted on Reply
#12
Konceptz
john_slower in everything else compared to AMD and a joke in efficiency compared to either AMD or Nvidia.
Other than games...what is AMD faster than Intel at?
Posted on Reply
#13
unwind-protect
Vya DomusStrange take, it obviously is a computer, I don't know what shared memory is even supposed to mean in this context.
All cores in the computer can reach all RAM locations. So that you can use threads or processes for parallelism without having to do inter-machine communication (aka networking).

A "supercomputer" like this is just an optimized network of individual computers.
Posted on Reply
#14
john_
KonceptzOther than games...what is AMD faster than Intel at?
You do realize that supercomputers are not intent for gaming, right?
TheinsanegamerNI wonder how much longer Pat will last, he gotta be able to deliver things on time, or if they are late, blow expectations away. He's currently doing neither.
He is focused on fabs and how many billions he can get to build them.
Intel stock up as $11B Apollo deal nears completion: WSJ
We can laugh at him constantly for the next 1-2-3 years, but if his plan works, if he builds those fabs, are competitive or even better than TSMCs, Intel will become again a successful behemoth and most of all, it's fate wouldn't be depended on x86 success/survival against ARM or whatever else comes in the future.
Posted on Reply
#15
AnarchoPrimitiv
I've got a strong feeling that Intel practically gave this hardware away....why else would anyone base a supercomputer on it when AMD and Nvidia are both objectively better options?
Posted on Reply
#16
Vya Domus
unwind-protectAll cores in the computer can reach all RAM locations.
This actually isn't even true all of the time, depending on the topology some cores might not in fact have direct accesses to RAM and have to interface with what's effectively an on chip network controller to talk to an available memory controller, which might not even be on the same chip. Your notion of a "computer" is very outdated and I don't think a computer was ever defined how you think it is. How the memory system works is nothing more than an implementation detail, everything that's turing complete is a computer, it makes no sense to say that it's not a computer just because it's comprised of multiple nodes. In a cluster each node can in fact access memory locations from other nodes, that's necessary.

Your PC has a CPU and a GPU, usually with their own separate physical memory, they can't interface with each other directly, they can through software same as in a supercomputer, does your PC count as a cluster ? A cluster is made out of computers but a cluster is not a computer even though they all can access each other's memory, which is the point of having a cluster ?
KonceptzOther than games...what is AMD faster than Intel at?
Man, it's crazy to think that a couple of years ago the roles would have been completely reversed but the argument would have been the same. :roll:
Posted on Reply
#17
unwind-protect
Vya DomusThis actually isn't even true all of the time, depending on the topology some cores might not in fact have direct accesses to RAM and have to interface with what's effectively an on chip network controller to talk to an available memory controller, which might not even be on the same chip.
I actually have systems with 4 NUMA banks. The point is that virtual addresses on all cores in all CPUs are mapped to the physical RAM somewhere in the machine. So while the hardware might access a given piece of RAM through some other core's memory controller that is transparent to the software I am running. That is no longer the case with networked "computers" such as this one.
Posted on Reply
#18
ncrs
Vya DomusThis actually isn't even true all of the time, depending on the topology some cores might not in fact have direct accesses to RAM and have to interface with what's effectively an on chip network controller to talk to an available memory controller, which might not even be on the same chip. Your notion of a "computer" is very outdated and I don't think a computer was ever defined how you think it is. How the memory system works is nothing more than an implementation detail, everything that's turing complete is a computer, it makes no sense to say that it's not a computer just because it's comprised of multiple nodes.
Even in NUMA designs the physical address space is still uniform. What I mean by that is that a core in one module/chiplet/socket can access every memory location by physical address, regardless of how it's achieved and how long it takes.
In most HPC clusters, even with technologies like RDMA, there is no uniform physical address space so that's not possible. There will be an explicit translation layer from physical address space of one node to physical address space of another. It can be made to look like it's uniform with PGAS, but it's not at the hardware level.
I wrote "most" because there are some specialized designs that do have uniform physical address space across multiple nodes like IBM Power10 with PowerAXON, and that is done at the hardware level.
Vya DomusYour PC has a CPU and a GPU, usually with their own separate physical memory, they can't interface with each other directly, they can through software same as in a supercomputer, does your PC count as a cluster ?
Sure they can interface directly. That's one of the ways GPU drivers communicate with hardware - via PCI BARs (previously in up to 256MB windows, but with ReBAR this can be exceeded). This mechanism can also be used to facilitate direct communication between PCIe devices like network cards, an example of which is NVIDIA GPUDirect RDMA.
Posted on Reply
#19
Vya Domus
ncrsI wrote "most" because there are some specialized designs that do have uniform physical address space across multiple nodes like IBM Power10 with PowerAXON, and that is done at the hardware level.
I don't understand the relevance of whether they do have a uniform physical address space at the hardware level or not, they still share memory.
unwind-protectThe point is that virtual addresses on all cores in all CPUs are mapped to the physical RAM somewhere in the machine.
I just don't see how that could ever mean something is not a computer. Also that virtual address space can map to anything, it can be system memory, disk or memory from an entirely different machine for that matter. You can absolutely have the same virtual memory space across however many nodes you want.
Posted on Reply
#20
unwind-protect
Vya DomusYou can absolutely have the same virtual memory space across however many nodes you want.
That requires very extensive software support that is practically not used in high performance computing. Most people use MPI, which is explicit networking and destroys the simple (for software) model.
Posted on Reply
#21
ncrs
Vya DomusI don't understand the relevance of whether they do have a uniform physical address space at the hardware level or not, they still share memory.
I just don't see how that could ever mean something is not a computer.
Only systems with uniform physical address spaces truly share memory. Systems without that are clusters of computers at most, and not singular computers.
You can't run normal software on a cluster node and expect it to magically be able to utilize memory on every node of it.
Vya DomusAlso that virtual address space can map to anything, [...]
Why do you switch from physical to virtual in this argument?
Vya Domus[...] or memory from an entirely different machine for that matter. You can absolutely have the same virtual memory space across however many nodes you want.
Not easily.
Posted on Reply
#22
Vya Domus
unwind-protectThat requires very extensive software support that is practically not used
In practice this is always the case but this applies to literally everything though, you always optimize to reduce communication between parts in a system, it doesn't mean the system can't act as a whole.
ncrsWhy do you switch from physical to virtual in this argument?
I didn't, the other guy brought it up.
ncrsNot easily.
What does it matter if it's easy or not in this context. The argument boils down to "this isn't a real computer because memory can't be shared" but of course it can, that's the point of having a cluster.
Posted on Reply
#23
unwind-protect
Then let me ask you this:

What is, in your opinion, the difference between this supercomputer on one hand and a couple of racks of "normal" machines with very fast networking?

I mean from a software/programming standpoint.
Posted on Reply
#24
ncrs
Vya DomusWhat does it matter if it's easy or not in this context.
It matters because it's not being done routinely. I know of only one modern production implementation and that's IBM Power10. Software-based emulation has heavy drawbacks, and would require extreme network performance in both bandwidth and latency.
Vya DomusThe argument boil down to "this isn't a real computer because memory can't be shared" but of course it can, that's the point of having a cluster.
That was never the argument, please re-read the first post you replied to. It was about being a singular computer vs. a cluster of computers. Programming for a cluster is way harder than a singular computer, just like MT programming is more difficult than ST.
Clusters with proper memory sharing are rare.
Posted on Reply
#25
unwind-protect
ncrsIt matters because it's not being done routinely. I know of only one modern production implementation and that's IBM Power10. Software-based emulation has heavy drawbacks, and would require extreme network performance in both bandwidth and latency.
It isn't just that. The bigger problem with virtual memory pages that are actually remote is that the local software can't tell which pages are fast and which ones are orders of magnitude slower. That is another reason why such schemes are unpopular. People rather use MPI if they can't have a single computer instead of mcgyvering together something with segfault handlers or userfaultfd.
Posted on Reply
Add your own comment
Nov 16th, 2024 10:30 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts