Monday, October 10th 2022

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

When AMD announced that the company would deliver the world's fastest supercomputer, Frontier, the company also took a massive task to provide a machine capable of producing one ExaFLOP of total sustained ability to perform computing tasks. While the system is finally up and running, making a machine of that size run properly is challenging. In the world of High-Performance Computing, getting the hardware is only a portion of running the HPC center. In an interview with InsideHPC, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), provided insight into what it is like to run the world's fastest supercomputer and what kinds of issues it is facing.

The Frontier system is powered by AMD EPYC 7A53s "Trento" 64-core 2.0 GHz CPUs and Instinct MI250X GPUs. Interconnecting everything is the HPE (Cray) Slingshot 64-port switch, which is responsible for sending data in and out of compute blades. The recent interview points out a rather interesting finding: exactly AMD Instinct MI250X GPUs and Slingshot interconnect cause hardware troubles for the Frontier. "It's mostly issues of scale coupled with the breadth of applications, so the issues we're encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that," says Justin Whitt. In addition to the limits of scale "The issues span lots of different categories, the GPUs are just one. A lot of challenges are focused around those, but that's not the majority of the challenges that we're seeing," he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."
Many applications cannot run on hardware of that size, so unique tuning is needed. With the hardware issues that AMD GPUs provide, it is a bit harder to have an operational system on time. However, the Oak Ridge team is confident in their expertise and has no trouble meeting deadlines. For more information read the InsideHPC interview.
Source: InsideHPC
Add your own comment

48 Comments on AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

#1
Dirt Chip
Sucks to be an early adopter on a multi 100s million dollar product
:)
Posted on Reply
#2
Bwaze
Those pesky thousands of vacuum tubes that constantly need replacement...
Posted on Reply
#3
pavle
We'll see what kind of leadership they're offering and yeah, it's easy to spend so much money if it's not yours. Perhaps they'll compute their way to heaven. :)
Posted on Reply
#4
Crackong
60 million parts...
Even a 0.001% chance of malfunction would mean 100% in this scale
There are always more than 1 component having malfunction in any given time of operation.
Posted on Reply
#5
Vayra86
Pic says it all, too many wires!
Posted on Reply
#6
nguyen
So, basically AMD FineWine
Posted on Reply
#7
ARF
This is HPE's fault in its Slingshot Switch.
Posted on Reply
#8
Chomiq
Nothing burger:
We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary.
Posted on Reply
#9
ratirt
They will get it working properly. Normal stuff.
Posted on Reply
#10
Wirko
We don't know what are the consequences of parts failing. Are the Instincts and the switches redundant, so if a few of them fail, computing continues? How many can fail at the same time? Also, are they hot swappable?
Posted on Reply
#11
Dirt Chip
I would kill the quick pool, nothing good will come out of it.
Posted on Reply
#12
ZoneDymo
"he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary.""

What is this reasonable level-headedness?!?
I need outrage! panic! outright insanity!
Posted on Reply
#13
Count von Schwalbe
ZoneDymoWhat is this reasonable level-headedness?!?
I need outrage! panic! outright insanity!
To be found in the title/headline...
Posted on Reply
#14
Camm
Put your hand up if you've ever been involved with HPC clusters?

The reality is these things always take time to bed in, even if you are buying a cluster based off a pre-existing solution. In no way surprised they are having issues with the interconnects, its ALWAYS the fucking interconnects, lol.
Posted on Reply
#15
docnorth
I know it's wrong to project my experience with desktop PC's to supercomputers, but still couldn't resist and voted that Aurora "will come with fewer hiccups". Not guessing faster or slower, just smoother.

Posted on Reply
#16
Leiesoldat
lazy gamer & woodworker
docnorthI know it's wrong to project my experience with desktop PC's to supercomputers, but still couldn't resist and voted that Aurora "will come with fewer hiccups". Not guessing faster or slower, just smoother.

Hahaha that's a laugh. Aurora is still delayed and as far as most of the teams can tell Intel has delivered hardly any of the cabinets to Argonne National Laboratory. Aurora is anything but smooth. As far as I can tell, Intel is more worried about their new fab in Ohio than delivering Aurora at all. Plus Aurora was never meant to be an exascale machine, but rather a bridge between Summit and Frontier.

Camm has it right that scale and interconnect are the issues; Slingshot has been a long running problem.
Posted on Reply
#17
bug
Fine wine, I guess. Give it 5 years or so and it will work :P
Posted on Reply
#18
phanbuey
ChomiqNothing burger:
This.

The issues stem from the actual software and system setup, scheduling jobs etc, not necessarily faults with the hardware.
Posted on Reply
#19
thegnome
AMD FineWine™ strikes again! Except this is an insanely complex, super power server, there's bound to be issues. Especially with cutting edge tech...
Posted on Reply
#20
ThomasK
ARFThis is HPE's fault in its Slingshot Switch.
AMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.
Posted on Reply
#21
bug
thegnomeAMD FineWine™ strikes again! Except this is an insanely complex, super power server, there's bound to be issues. Especially with cutting edge tech...
That depends. Issues are normal up to and including the acceptance test period. After that, it's supposed to work for the most part. There will always be bugs, they are supposed to be rare and far apart. A delivered system is supposed to be usable.
Posted on Reply
#22
Punkenjoy
I have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.


So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.
Posted on Reply
#23
Operandi
ThomasKAMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.
Yeah, it does because in the context of a supercomputer its probably the most custom hardware in the system and thats all Cray.
Posted on Reply
#24
P4-630
Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...
Posted on Reply
#25
Count von Schwalbe
P4-630Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...
Intel/Intel (Aurora):
The Intel supercomputer, which was repeatedly delayed and reworked, is now expected to be "comfortably over 2 exaflops" in peak compute performance, thanks to Intel's new GPUs performing better than expected.
Also, the "Summit" supercomputer uses IBM CPUs and Nvidia Tesla GPUs, so ORNL would be having a fit if "Frontier" was much worse.
Posted on Reply
Add your own comment
Nov 21st, 2024 11:12 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts