Monday, October 10th 2022
AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues
When AMD announced that the company would deliver the world's fastest supercomputer, Frontier, the company also took a massive task to provide a machine capable of producing one ExaFLOP of total sustained ability to perform computing tasks. While the system is finally up and running, making a machine of that size run properly is challenging. In the world of High-Performance Computing, getting the hardware is only a portion of running the HPC center. In an interview with InsideHPC, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), provided insight into what it is like to run the world's fastest supercomputer and what kinds of issues it is facing.
The Frontier system is powered by AMD EPYC 7A53s "Trento" 64-core 2.0 GHz CPUs and Instinct MI250X GPUs. Interconnecting everything is the HPE (Cray) Slingshot 64-port switch, which is responsible for sending data in and out of compute blades. The recent interview points out a rather interesting finding: exactly AMD Instinct MI250X GPUs and Slingshot interconnect cause hardware troubles for the Frontier. "It's mostly issues of scale coupled with the breadth of applications, so the issues we're encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that," says Justin Whitt. In addition to the limits of scale "The issues span lots of different categories, the GPUs are just one. A lot of challenges are focused around those, but that's not the majority of the challenges that we're seeing," he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."Many applications cannot run on hardware of that size, so unique tuning is needed. With the hardware issues that AMD GPUs provide, it is a bit harder to have an operational system on time. However, the Oak Ridge team is confident in their expertise and has no trouble meeting deadlines. For more information read the InsideHPC interview.
Source:
InsideHPC
The Frontier system is powered by AMD EPYC 7A53s "Trento" 64-core 2.0 GHz CPUs and Instinct MI250X GPUs. Interconnecting everything is the HPE (Cray) Slingshot 64-port switch, which is responsible for sending data in and out of compute blades. The recent interview points out a rather interesting finding: exactly AMD Instinct MI250X GPUs and Slingshot interconnect cause hardware troubles for the Frontier. "It's mostly issues of scale coupled with the breadth of applications, so the issues we're encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that," says Justin Whitt. In addition to the limits of scale "The issues span lots of different categories, the GPUs are just one. A lot of challenges are focused around those, but that's not the majority of the challenges that we're seeing," he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."Many applications cannot run on hardware of that size, so unique tuning is needed. With the hardware issues that AMD GPUs provide, it is a bit harder to have an operational system on time. However, the Oak Ridge team is confident in their expertise and has no trouble meeting deadlines. For more information read the InsideHPC interview.
48 Comments on AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues
You can use a white pitchfork for this one instead of a red one.
"We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."
It doesn't get any more clickbait than that.
Inside picture of Hilary Clinton.
this is just clickbait garbage. humans bore me. i guess i need to start drinking now
Exascale, Exaproblems. Well yeah, but it's also how you spot the people who lack the ability to think and leap on answers that fit an existing worldview
The reality is that this is absolutely normal.
But this is my escapism, so please don't judge me for worse ;)
Gathered in another forum's comment section.
It does explain some peoples purchasing and hardware/brand preferences, if they can't get past the headlines to actually read the content