• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,579 (0.97/day)
When AMD announced that the company would deliver the world's fastest supercomputer, Frontier, the company also took a massive task to provide a machine capable of producing one ExaFLOP of total sustained ability to perform computing tasks. While the system is finally up and running, making a machine of that size run properly is challenging. In the world of High-Performance Computing, getting the hardware is only a portion of running the HPC center. In an interview with InsideHPC, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), provided insight into what it is like to run the world's fastest supercomputer and what kinds of issues it is facing.

The Frontier system is powered by AMD EPYC 7A53s "Trento" 64-core 2.0 GHz CPUs and Instinct MI250X GPUs. Interconnecting everything is the HPE (Cray) Slingshot 64-port switch, which is responsible for sending data in and out of compute blades. The recent interview points out a rather interesting finding: exactly AMD Instinct MI250X GPUs and Slingshot interconnect cause hardware troubles for the Frontier. "It's mostly issues of scale coupled with the breadth of applications, so the issues we're encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that," says Justin Whitt. In addition to the limits of scale "The issues span lots of different categories, the GPUs are just one. A lot of challenges are focused around those, but that's not the majority of the challenges that we're seeing," he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."



Many applications cannot run on hardware of that size, so unique tuning is needed. With the hardware issues that AMD GPUs provide, it is a bit harder to have an operational system on time. However, the Oak Ridge team is confident in their expertise and has no trouble meeting deadlines. For more information read the InsideHPC interview.

View at TechPowerUp Main Site | Source
 
Joined
Jul 15, 2020
Messages
1,021 (0.64/day)
System Name Dirt Sheep | Silent Sheep
Processor i5-2400 | 13900K (-0.02mV offset)
Motherboard Asus P8H67-M LE | Gigabyte AERO Z690-G, bios F29e Intel baseline
Cooling Scythe Katana Type 1 | Noctua NH-U12A chromax.black
Memory G-skill 2*8GB DDR3 | Corsair Vengeance 4*32GB DDR5 5200Mhz C40 @4000MHz
Video Card(s) Gigabyte 970GTX Mini | NV 1080TI FE (cap at 50%, 800mV)
Storage 2*SN850 1TB, 230S 4TB, 840EVO 128GB, WD green 2TB HDD, IronWolf 6TB, 2*HC550 18TB in RAID1
Display(s) LG 21` FHD W2261VP | Lenovo 27` 4K Qreator 27
Case Thermaltake V3 Black|Define 7 Solid, stock 3*14 fans+ 2*12 front&buttom+ out 1*8 (on expansion slot)
Audio Device(s) Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply Enermax Pro82+ 525W | Corsair RM650x (2021)
Mouse Logitech Master 3
Keyboard Roccat Isku FX
VR HMD Nop.
Software WIN 10 | WIN 11
Benchmark Scores CB23 SC: i5-2400=641 | i9-13900k=2325-2281 MC: i5-2400=i9 13900k SC | i9-13900k=37240-35500
Sucks to be an early adopter on a multi 100s million dollar product
:)
 
Joined
May 20, 2020
Messages
1,370 (0.83/day)
We'll see what kind of leadership they're offering and yeah, it's easy to spend so much money if it's not yours. Perhaps they'll compute their way to heaven. :)
 
Joined
Feb 15, 2019
Messages
1,658 (0.79/day)
System Name Personal Gaming Rig
Processor Ryzen 7800X3D
Motherboard MSI X670E Carbon
Cooling MO-RA 3 420
Memory 32GB 6000MHz
Video Card(s) RTX 4090 ICHILL FROSTBITE ULTRA
Storage 4x 2TB Nvme
Display(s) Samsung G8 OLED
Case Silverstone FT04
60 million parts...
Even a 0.001% chance of malfunction would mean 100% in this scale
There are always more than 1 component having malfunction in any given time of operation.
 
Joined
Sep 17, 2014
Messages
22,431 (6.03/day)
Location
The Washing Machine
Processor 7800X3D
Motherboard MSI MAG Mortar b650m wifi
Cooling Thermalright Peerless Assassin
Memory 32GB Corsair Vengeance 30CL6000
Video Card(s) ASRock RX7900XT Phantom Gaming
Storage Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s) Gigabyte G34QWC (3440x1440)
Case Lian Li A3 mATX White
Audio Device(s) Harman Kardon AVR137 + 2.1
Power Supply EVGA Supernova G2 750W
Mouse Steelseries Aerox 5
Keyboard Lenovo Thinkpad Trackpoint II
Software W11 IoT Enterprise LTSC
Benchmark Scores Over 9000
Pic says it all, too many wires!
 
Joined
Nov 11, 2016
Messages
3,403 (1.16/day)
System Name The de-ploughminator Mk-III
Processor 9800X3D
Motherboard Gigabyte X870E Aorus Master
Cooling DeepCool AK620
Memory 2x32GB G.SKill 6400MT Cas32
Video Card(s) Asus RTX4090 TUF
Storage 4TB Samsung 990 Pro
Display(s) 48" LG OLED C4
Case Corsair 5000D Air
Audio Device(s) KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply Corsair HX850
Mouse Razor Death Adder v3
Keyboard Razor Huntsman V3 Pro TKL
Software win11
So, basically AMD FineWine
 

ARF

Joined
Jan 28, 2020
Messages
4,670 (2.65/day)
Location
Ex-usa | slava the trolls
This is HPE's fault in its Slingshot Switch.
 
Joined
Feb 23, 2019
Messages
6,061 (2.89/day)
Location
Poland
Processor Ryzen 7 5800X3D
Motherboard Gigabyte X570 Aorus Elite
Cooling Thermalright Phantom Spirit 120 SE
Memory 2x16 GB Crucial Ballistix 3600 CL16 Rev E @ 3800 CL16
Video Card(s) RTX3080 Ti FE
Storage SX8200 Pro 1 TB, Plextor M6Pro 256 GB, WD Blue 2TB
Display(s) LG 34GN850P-B
Case SilverStone Primera PM01 RGB
Audio Device(s) SoundBlaster G6 | Fidelio X2 | Sennheiser 6XX
Power Supply SeaSonic Focus Plus Gold 750W
Mouse Endgame Gear XM1R
Keyboard Wooting Two HE
Nothing burger:
We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary.
 
Joined
May 31, 2016
Messages
4,437 (1.43/day)
Location
Currently Norway
System Name Bro2
Processor Ryzen 5800X
Motherboard Gigabyte X570 Aorus Elite
Cooling Corsair h115i pro rgb
Memory 32GB G.Skill Flare X 3200 CL14 @3800Mhz CL16
Video Card(s) Powercolor 6900 XT Red Devil 1.1v@2400Mhz
Storage M.2 Samsung 970 Evo Plus 500MB/ Samsung 860 Evo 1TB
Display(s) LG 27UD69 UHD / LG 27GN950
Case Fractal Design G
Audio Device(s) Realtec 5.1
Power Supply Seasonic 750W GOLD
Mouse Logitech G402
Keyboard Logitech slim
Software Windows 10 64 bit
They will get it working properly. Normal stuff.
 
Joined
Jan 3, 2021
Messages
3,484 (2.46/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
We don't know what are the consequences of parts failing. Are the Instincts and the switches redundant, so if a few of them fail, computing continues? How many can fail at the same time? Also, are they hot swappable?
 
Joined
Jul 15, 2020
Messages
1,021 (0.64/day)
System Name Dirt Sheep | Silent Sheep
Processor i5-2400 | 13900K (-0.02mV offset)
Motherboard Asus P8H67-M LE | Gigabyte AERO Z690-G, bios F29e Intel baseline
Cooling Scythe Katana Type 1 | Noctua NH-U12A chromax.black
Memory G-skill 2*8GB DDR3 | Corsair Vengeance 4*32GB DDR5 5200Mhz C40 @4000MHz
Video Card(s) Gigabyte 970GTX Mini | NV 1080TI FE (cap at 50%, 800mV)
Storage 2*SN850 1TB, 230S 4TB, 840EVO 128GB, WD green 2TB HDD, IronWolf 6TB, 2*HC550 18TB in RAID1
Display(s) LG 21` FHD W2261VP | Lenovo 27` 4K Qreator 27
Case Thermaltake V3 Black|Define 7 Solid, stock 3*14 fans+ 2*12 front&buttom+ out 1*8 (on expansion slot)
Audio Device(s) Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply Enermax Pro82+ 525W | Corsair RM650x (2021)
Mouse Logitech Master 3
Keyboard Roccat Isku FX
VR HMD Nop.
Software WIN 10 | WIN 11
Benchmark Scores CB23 SC: i5-2400=641 | i9-13900k=2325-2281 MC: i5-2400=i9 13900k SC | i9-13900k=37240-35500
I would kill the quick pool, nothing good will come out of it.
 
Joined
Feb 11, 2009
Messages
5,547 (0.96/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
"he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary.""

What is this reasonable level-headedness?!?
I need outrage! panic! outright insanity!
 

Count von Schwalbe

Moderator
Staff member
Joined
Nov 15, 2021
Messages
3,065 (2.78/day)
Location
Knoxville, TN, USA
System Name Work Computer | Unfinished Computer
Processor Core i7-6700 | Ryzen 5 5600X
Motherboard Dell Q170 | Gigabyte Aorus Elite Wi-Fi
Cooling A fan? | Truly Custom Loop
Memory 4x4GB Crucial 2133 C17 | 4x8GB Corsair Vengeance RGB 3600 C26
Video Card(s) Dell Radeon R7 450 | RTX 2080 Ti FE
Storage Crucial BX500 2TB | TBD
Display(s) 3x LG QHD 32" GSM5B96 | TBD
Case Dell | Heavily Modified Phanteks P400
Power Supply Dell TFX Non-standard | EVGA BQ 650W
Mouse Monster No-Name $7 Gaming Mouse| TBD
What is this reasonable level-headedness?!?
I need outrage! panic! outright insanity!
To be found in the title/headline...
 
Joined
Sep 26, 2012
Messages
871 (0.20/day)
Location
Australia
System Name ATHENA
Processor AMD 7950X
Motherboard ASUS Crosshair X670E Extreme
Cooling ASUS ROG Ryujin III 360, 13 x Lian Li P28
Memory 2x32GB Trident Z RGB 6000Mhz CL30
Video Card(s) ASUS 4090 STRIX
Storage 3 x Kingston Fury 4TB, 4 x Samsung 870 QVO
Display(s) Acer X38S, Wacom Cintiq Pro 15
Case Lian Li O11 Dynamic EVO
Audio Device(s) Topping DX9, Fluid FPX7 Fader Pro, Beyerdynamic T1 G2, Beyerdynamic MMX300
Power Supply Seasonic PRIME TX-1600
Mouse Xtrfy MZ1 - Zy' Rail, Logitech MX Vertical, Logitech MX Master 3
Keyboard Logitech G915 TKL
VR HMD Oculus Quest 2
Software Windows 11 + Universal Blue
Put your hand up if you've ever been involved with HPC clusters?

The reality is these things always take time to bed in, even if you are buying a cluster based off a pre-existing solution. In no way surprised they are having issues with the interconnects, its ALWAYS the fucking interconnects, lol.
 
Joined
Sep 14, 2020
Messages
568 (0.37/day)
Location
Greece
System Name Office / HP Prodesk 490 G3 MT (ex-office)
Processor Intel 13700 (90° limit) / Intel i7-6700
Motherboard Asus TUF Gaming H770 Pro / HP 805F H170
Cooling Noctua NH-U14S / Stock
Memory G. Skill Trident XMP 2x16gb DDR5 6400MHz cl32 / Samsung 2x8gb 2133MHz DDR4
Video Card(s) Asus RTX 3060 Ti Dual OC GDDR6X / Zotac GTX 1650 GDDR6 OC
Storage Samsung 2tb 980 PRO MZ / Samsung SSD 1TB 860 EVO + WD blue HDD 1TB (WD10EZEX)
Display(s) Eizo FlexScan EV2455 - 1920x1200 / Panasonic TX-32LS490E 32'' LED 1920x1080
Case Nanoxia Deep Silence 8 Pro / HP microtower
Audio Device(s) On board
Power Supply Seasonic Prime PX750 / OEM 300W bronze
Mouse MS cheap wired / Logitech cheap wired m90
Keyboard MS cheap wired / HP cheap wired
Software W11 / W7 Pro ->10 Pro
I know it's wrong to project my experience with desktop PC's to supercomputers, but still couldn't resist and voted that Aurora "will come with fewer hiccups". Not guessing faster or slower, just smoother.

 

Leiesoldat

lazy gamer & woodworker
Supporter
Joined
Jun 29, 2021
Messages
122 (0.10/day)
System Name Arda
Processor AMD Ryzen 5800X3D
Motherboard Gigabyte X570-I AORUS Pro WiFi
Cooling Custom Loop - Aquacomputer, Optimus, EK, Bykski
Memory GSkill Trident Z RGB 32 GB (2x16) DDR4-3200
Video Card(s) Gigabyte Gaming OC RX 6800XT
Storage SK Hynix P41 1TB
Display(s) VIOTEK 3440 x 1440 144 Hz Curved
Case XTIA Proto-XL
Audio Device(s) Schiit Modius + Schiit Jotunheim
Power Supply Seasonic Prime 850W Titanium
Mouse Xtrfy MZ1 Zy's Rail Wireless
Keyboard Rainkeebs Yasui - Custom 40% Ortholinear
Software Windows 11 Pro
I know it's wrong to project my experience with desktop PC's to supercomputers, but still couldn't resist and voted that Aurora "will come with fewer hiccups". Not guessing faster or slower, just smoother.


Hahaha that's a laugh. Aurora is still delayed and as far as most of the teams can tell Intel has delivered hardly any of the cabinets to Argonne National Laboratory. Aurora is anything but smooth. As far as I can tell, Intel is more worried about their new fab in Ohio than delivering Aurora at all. Plus Aurora was never meant to be an exascale machine, but rather a bridge between Summit and Frontier.

Camm has it right that scale and interconnect are the issues; Slingshot has been a long running problem.
 

bug

Joined
May 22, 2015
Messages
13,755 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
Fine wine, I guess. Give it 5 years or so and it will work :p
 
Joined
Nov 13, 2007
Messages
10,748 (1.73/day)
Location
Austin Texas
System Name stress-less
Processor 9800X3D @ 5.42GHZ
Motherboard MSI PRO B650M-A Wifi
Cooling Thermalright Phantom Spirit EVO
Memory 64GB DDR5 6400 CL30 / 2133 fclk
Video Card(s) RTX 4090 FE
Storage 2TB WD SN850, 4TB WD SN850X
Display(s) Alienware 32" 4k 240hz OLED
Case Jonsbo Z20
Audio Device(s) Yes
Power Supply Corsair SF750
Mouse DeathadderV2 X Hyperspeed
Keyboard 65% HE Keyboard
Software Windows 11
Benchmark Scores They're pretty good, nothing crazy.
Nothing burger:

This.

The issues stem from the actual software and system setup, scheduling jobs etc, not necessarily faults with the hardware.
 
Joined
Dec 26, 2020
Messages
382 (0.27/day)
System Name Incomplete thing 1.0
Processor Ryzen 2600
Motherboard B450 Aorus Elite
Cooling Gelid Phantom Black
Memory HyperX Fury RGB 3200 CL16 16GB
Video Card(s) Gigabyte 2060 Gaming OC PRO
Storage Dual 1TB 970evo
Display(s) AOC G2U 1440p 144hz, HP e232
Case CM mb511 RGB
Audio Device(s) Reloop ADM-4
Power Supply Sharkoon WPM-600
Mouse G502 Hero
Keyboard Sharkoon SGK3 Blue
Software W10 Pro
Benchmark Scores 2-5% over stock scores
AMD FineWine™ strikes again! Except this is an insanely complex, super power server, there's bound to be issues. Especially with cutting edge tech...
 
Joined
Aug 12, 2010
Messages
127 (0.02/day)
Location
Brazil
Processor Ryzen 7 7800X3D
Motherboard ASRock B650M PG Riptide
Cooling Wraith Max + 2x Noctua Redux NF-P12
Memory 2x16GB ADATA XPG Lancer Blade DDR5-6000 CL30
Video Card(s) Powercolor RX 7800 XT Fighter OC
Storage ADATA Legend 970 2TB PCIe 5.0
Display(s) Dell 32" S3222DGM - 1440P 165Hz + P2422H
Case HYTE Y40
Audio Device(s) Microsoft Xbox TLL-00008
Power Supply Cooler Master MWE 750 V2
Mouse Alienware AW320M
Keyboard Alienware AW510K
Software Windows 11 Pro
This is HPE's fault in its Slingshot Switch.

AMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.
 

bug

Joined
May 22, 2015
Messages
13,755 (3.96/day)
Processor Intel i5-12600k
Motherboard Asus H670 TUF
Cooling Arctic Freezer 34
Memory 2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s) EVGA GTX 1060 SC
Storage 500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s) Dell U3219Q + HP ZR24w
Case Raijintek Thetis
Audio Device(s) Audioquest Dragonfly Red :D
Power Supply Seasonic 620W M12
Mouse Logitech G502 Proteus Core
Keyboard G.Skill KM780R
Software Arch Linux + Win10
AMD FineWine™ strikes again! Except this is an insanely complex, super power server, there's bound to be issues. Especially with cutting edge tech...
That depends. Issues are normal up to and including the acceptance test period. After that, it's supposed to work for the most part. There will always be bugs, they are supposed to be rare and far apart. A delivered system is supposed to be usable.
 
Joined
Oct 12, 2005
Messages
707 (0.10/day)
I have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.


So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.
 
Joined
Jan 28, 2021
Messages
851 (0.61/day)
AMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.
Yeah, it does because in the context of a supercomputer its probably the most custom hardware in the system and thats all Cray.
 
Joined
Jan 5, 2006
Messages
18,584 (2.69/day)
System Name AlderLake
Processor Intel i7 12700K P-Cores @ 5Ghz
Motherboard Gigabyte Z690 Aorus Master
Cooling Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans
Memory 32GB DDR5 Corsair Dominator Platinum RGB 6000MT/s CL36
Video Card(s) MSI RTX 2070 Super Gaming X Trio
Storage Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2
Display(s) 23.8" Dell S2417DG 165Hz G-Sync 1440p
Case Be quiet! Silent Base 600 - Window
Audio Device(s) Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply Seasonic Focus Plus Gold 750W
Mouse Logitech MX Anywhere 2 Laser wireless
Keyboard RAPOO E9270P Black 5GHz wireless
Software Windows 11
Benchmark Scores Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock
Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...
 
Top