Thursday, July 6th 2023
Two-ExaFLOP El Capitan Supercomputer Starts Installation Process with AMD Instinct MI300A
When Lawrence Livermore National Laboratory (LLNL) announced the creation of a two-ExaFLOP supercomputer named El Capitan, we heard that AMD would power it with its Instinct MI300 accelerator. Today, LNLL published a Tweet that states, "We've begun receiving & installing components for El Capitan, @NNSANews' first #exascale #supercomputer. While we're still a ways from deploying it for national security purposes in 2024, it's exciting to see years of work becoming reality." As published images show, HPE racks filled with AMD Instinct MI300 are showing up now at LNLL's facility, and the supercomputer is expected to go operational in 2024. This could mean that November 2023 TOP500 list update wouldn't feature El Capitan, as system enablement would be very hard to achieve in four months until then.
The El Capitan supercomputer is expected to run on AMD Instinct MI300A accelerator, which features 24 Zen4 cores, CDNA3 architecture, and 128 GB of HBM3 memory. All paired together in a four-accelerator configuration goes inside each node from HPE, also getting water cooling treatment. While we don't have many further details on the memory and storage of El Capitan, we know that the system will exceed two ExFLOPS at peak and will consume close to 40 MW of power.
Source:
LNLL (Twitter)
The El Capitan supercomputer is expected to run on AMD Instinct MI300A accelerator, which features 24 Zen4 cores, CDNA3 architecture, and 128 GB of HBM3 memory. All paired together in a four-accelerator configuration goes inside each node from HPE, also getting water cooling treatment. While we don't have many further details on the memory and storage of El Capitan, we know that the system will exceed two ExFLOPS at peak and will consume close to 40 MW of power.
29 Comments on Two-ExaFLOP El Capitan Supercomputer Starts Installation Process with AMD Instinct MI300A
Clean clothes, worn knees from... gasp, kneeling to get to the bottom servers.
What an odd take...
This is is aiming for the same >2 exaflops Aurora is aiming at but at 40MW instead of 70MW.
Curious to see how far off both systems will be, the slingshot networking doesn't seem to scale as well as expected (frontier hit a bit lower than expected), but its also ground breaking and factors of scale not previously encountered are sure to be popping up.
They won't and can't just pop in mi300x as per reasons stated, these are purpose built for El Capitan and the cpu "on die" is supposed to help with scaling. the 128gb vs 192gb doesn't matter when you scale to this node count... keeping scaling as linear as possible does.
The mi250x is showing 70-80% a100 performance in ai, and absolutely obliterates it in FP64/ traditional HPC work, the claimed 8x ai improvements the mi300a is bringing should make it very competitive against the H100.
AMD's datacenter show was clearly too technical for investors to grasp, the 55B parameter model on 1 gpu was absolutely insane.
We all know that this Instinct MI300A is superior to Nvidia.
We will have to see what functions the great CAPTAIN will do
What's also interesting is that the frame looks like a ... socket! Strange but apparently AMD is planning to also release socketed variants of the chip, or else they wouldn't have made this illustration.
www.opencompute.org/documents/ocp-accelerator-module-design-specification-v1p5-final-20220223-docx-1-pdf
See page 10.
Edit: Nevermind, it looks more like an SP6 socket.
So, lots of simulations for nuclear and other weapons, their impact and development, but also environmental disasters, etc. Such simulations and calculations need a lot of horse power, both CPU and GPU... MI300 is a perfect tool for this job. That depends. We do not know exactly the structure of the system. It might be APUs only, as they do not do LLMs but complex simulations with hundreds of variables, so they need both CPU and GPU power. No air-cooling. Too loud, too dusty. Nvidia has a few too.
www.eenewseurope.com/en/nvidia-launches-first-commercial-exascale-supercomputer/
Weather modeling.
Nuclear research (shhhhhhhh, that one's on the hush-hush except everyone knows that Department of Energy is the USA's nuke experts. And given that a lot of these supercomputers are top-secret, we can only assume what's going on...)
Like, what do a bunch of nuclear scientists want with a top-secret supercomputer that they aren't allowed to tell us the details of? Hmmm, I wonder.... fortunately, these strategic supercomputers have plenty of downtime from their main mission so that the rest of the scientific community can run on them on their spare cycles. I've heard of obscure mathematical theories being tested on these supercomputers, Ph.D thesis being written on data discovered in these, etc. etc. So its still to the benefit of the general USA's scientific community (at least when its not doing whatever nuclear research is going on...)
In the same manner it is impossible for 256 or 1024 Grace superchips to be anywhere near an Exaflop. as they are 67Tflops a pop FP32 which is how supercomputers are measured.
They could at most hit... 67 Petaflops with that announced and undeployed Euro cluster.
If we apply Nvidia metrics to AMD's MI300A El Capitan it should measure >64 exaflops but Nvidia isn't listing their metric, if that is fp16, bfloat16, fp8 or int8 or int4 even since they say exascale rather than exaflop...
My numbers are based on AMD's mi300a 228cu vs mi250 220cu scaling fp32 performance and 8x ai improvement claim, but like Nvidia's claim, we don't know what precision that is in.