Thursday, June 22nd 2023

Intel & HPE Declare Aurora Supercomputer Blade Installation Complete

What's New: The Aurora supercomputer at Argonne National Laboratory is now fully equipped with all 10,624 compute blades, boasting 63,744 Intel Data Center GPU Max Series and 21,248 Intel Xeon CPU Max Series processors. "Aurora is the first deployment of Intel's Max Series GPU, the biggest Xeon Max CPU-based system, and the largest GPU cluster in the world. We're proud to be part of this historic system and excited for the groundbreaking AI, science and engineering Aurora will enable."—Jeff McVeigh, Intel corporate vice president and general manager of the Super Compute Group

What Aurora Is: A collaboration of Intel, Hewlett Packard Enterprise (HPE) and the Department of Energy (DOE), the Aurora supercomputer is designed to unlock the potential of the three pillars of high performance computing (HPC): simulations, data analytics and artificial intelligence (AI) on an extremely large scale. The system incorporates more than 1,024 storage nodes (using DAOS, Intel's distributed asynchronous object storage), providing 220 terabytes (TB) of capacity at 31TBs of total bandwidth, and leverages the HPE Slingshot high-performance fabric. Later this year, Aurora is expected to be the world's first supercomputer to achieve a theoretical peak performance of more than 2 exaflops (an exaflop is 1018 or a billion billion operations per second) when it enters the TOP 500 list.
Aurora will harness the full power of the Intel Max Series GPU and CPU product family. Designed to meet the demands of dynamic and emerging HPC and AI workloads, early results with the Max Series GPUs demonstrate leading performance on real-world science and engineering workloads, showcasing up to 2 times the performance of AMD MI250X GPUs on OpenMC, and near linear scaling up to hundreds of nodes. The Intel Xeon Max Series CPU drives a 40% performance advantage over the competition in many real-world HPC workloads, such as earth systems modeling, energy and manufacturing.

Why It Matters: From tackling climate change to finding cures for deadly diseases, researchers face monumental challenges that demand advanced computing technologies at scale. Aurora is poised to address the needs of the HPC and AI communities, providing the necessary tools to push the boundaries of scientific exploration. "While we work toward acceptance testing, we're going to be using Aurora to train some large-scale open source generative AI models for science," said Rick Stevens, Argonne National Laboratory associate laboratory director. "Aurora, with over 60,000 Intel Max GPUs, a very fast I/O system, and an all-solid-state mass storage system, is the perfect environment to train these models."

How It Works: At the heart of this state-of-the-art system are Aurora's sleek rectangular blades, housing processors, memory, networking and cooling technologies. Each blade consists of two Intel Xeon Max Series CPUs and six Intel Max Series GPUs. The Xeon Max Series product family is already demonstrating great early performance on Sunspot (watch the video below), the test bed and development system with the same architecture as Aurora. Developers are utilizing oneAPI and AI tools to accelerate HPC and AI workloads and enhance code portability across multiple architectures.


The installation of these blades has been a delicate operation, with each 70-pound blade requiring specialized machinery to be vertically integrated into Aurora's refrigerator-sized racks. The system's 166 racks accommodate 64 blades each and span eight rows, occupying a space equivalent to two professional basketball courts in the Argonne Leadership Computing Facility (ALCF) data center.

Researchers from the ALCF's Aurora Early Science Program (ESP) and DOE's Exascale Computing Project will migrate their work from the Sunspot test bed to the fully installed Aurora. This transition will allow them to scale their applications on the full system. Early users will stress test the supercomputer and identify potential bugs that need to be resolved before deployment. This includes efforts to develop generative AI models for science, recently announced at the ISC'23 conference.
Source: Intel News Events / PR
Add your own comment

14 Comments on Intel & HPE Declare Aurora Supercomputer Blade Installation Complete

#2
PerfectWave
hope it will not catch fire because of hte heat LUL!
Posted on Reply
#3
AnarchoPrimitiv
I'm willing to bet that Intel either sold the hardware at cost or even cheaper....can you think of ANY other reason why someone would go with an all Intel Supercomputer? I'm seriously asking...
Posted on Reply
#4
TumbleGeorge
AnarchoPrimitivI'm willing to bet that Intel either sold the hardware at cost or even cheaper
Because here nobody knows true numbers of BOM.
Posted on Reply
#5
phraide
providing 220 terabytes (TB) of capacity
not impressive :D 220PB or 220Tb per storage node maybe ?
Posted on Reply
#6
Leiesoldat
lazy gamer & woodworker
AnarchoPrimitivI'm willing to bet that Intel either sold the hardware at cost or even cheaper....can you think of ANY other reason why someone would go with an all Intel Supercomputer? I'm seriously asking...
This was a stipulation set by the Department of Energy that the multiple supercomputers could not all be from the same vendor. This is also just the delivery of the computer cabinets itself and not the actual acceptance testing.
Posted on Reply
#7
Wirko
A bunch of neatly arranged boxes with neatly arranged piping ... that's fine, but it doesn't look all that impressive. Now show us the cooling system, Intel! With a few humans for scale.
Posted on Reply
#8
Patriot
AnarchoPrimitivI'm willing to bet that Intel either sold the hardware at cost or even cheaper....can you think of ANY other reason why someone would go with an all Intel Supercomputer? I'm seriously asking...
The last time they changed the spec Intel took a writeoff that quarter of 300M.
So yes, probably not making money on it.

AMD_Stock/comments/oq0odwCongratulations! 2 Exaflops! It just took ten years.
Davenwww.energy.gov/sites/default/files/2013/09/f2/20130913-SEAB-DOE-Exascale-Initiative.pdf

AMD and HPE did it in three.

www.hpe.com/us/en/newsroom/press-release/2020/03/hpe-and-amd-power-complex-scientific-discovery-in-worlds-fastest-supercomputer-for-us-department-of-energys-doe-national-nuclear-security-administration-nnsa.html
It technically hasn't been benchmarked yet.
And El-Capitan isn't finished being deployed yet.
Posted on Reply
#9
Solaris17
Super Dainty Moderator
phraidenot impressive :D 220PB or 220Tb per storage node maybe ?
the compute side and storage side are different. The storage side will grow and expand as research requirements needs it the compute side (and it’s configuration) are the big spend
TumbleGeorgeBecause here nobody knows true numbers of BOM.
For this? No. Probably not. There are plenty of real engineers on the forums though that deal with kind of thing everyday. You have to speak to your audience though. Higher compute or tech in general is easier to make a troll comment on than actually discuss. It’s hardly worth the effort since most users want higher Fortnite frame rates instead of actually learning.
Posted on Reply
#10
phraide
Solaris17the compute side and storage side are different. The storage side will grow and expand as research requirements needs it the compute side (and it’s configuration) are the big spend
www.alcf.anl.gov/aurora : storage specs "230 PB, 31 TB/s, 1024 Nodes (DAOS)"
It could not be 220TB only as the article says (or the way I read and understand the article sentance).
Posted on Reply
#11
Wirko
Solaris17the compute side and storage side are different. The storage side will grow and expand as research requirements needs it
That's hot, fast, write-intensive storage (according to some older presentation, it also contains some Optane). It's physically close to compute nodes, that's why it's decentralised into 1024 nodes. It's probably not destined to grow but can be complemented by colder, larger(?), less exciting and expandable storage, possibly spinning rust.
Posted on Reply
#12
Solaris17
Super Dainty Moderator
WirkoThat's hot, fast, write-intensive storage (according to some older presentation, it also contains some Optane). It's physically close to compute nodes, that's why it's decentralised into 1024 nodes. It's probably not destined to grow but can be complemented by colder, larger(?), less exciting and expandable storage, possibly spinning rust.
Most of the time this is infiniband to nvme then bleeds off to a larger array of SSD cached spinning rust.
Posted on Reply
#13
Wirko
phraidewww.alcf.anl.gov/aurora : storage specs "230 PB, 31 TB/s, 1024 Nodes (DAOS)"
It could not be 220TB only as the article says (or the way I read and understand the article sentance).
Well, someone at Intel didn't properly understand what they are selling. The 220 TB figure can be found at multiple web sites that didn't care to check Intel's press release, along with the "TBs" unit.

Also, total storage capacity divided by total speed amounts to two hours. If the capacity is fully used for input data and/or output data, the system spends at least two hours of precious supercomputer time transfering data to storage before processing, or from storage after processing, or both.
Posted on Reply
#14
TumbleGeorge
phraidewww.alcf.anl.gov/aurora : storage specs "230 PB, 31 TB/s, 1024 Nodes (DAOS)"
It could not be 220TB only as the article says (or the way I read and understand the article sentance).
220TB per node.
Posted on Reply
Add your own comment
Jan 21st, 2025 05:40 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts