AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

AleksandarK · Oct 10, 2022

When AMD announced that the company would deliver the world's fastest supercomputer, Frontier, the company also took a massive task to provide a machine capable of producing one ExaFLOP of total sustained ability to perform computing tasks. While the system is finally up and running, making a machine of that size run properly is challenging. In the world of High-Performance Computing, getting the hardware is only a portion of running the HPC center. In an interview with InsideHPC, Justin Whitt, program director for the Oak Ridge Leadership Computing Facility (OLCF), provided insight into what it is like to run the world's fastest supercomputer and what kinds of issues it is facing.

The Frontier system is powered by AMD EPYC 7A53s "Trento" 64-core 2.0 GHz CPUs and Instinct MI250X GPUs. Interconnecting everything is the HPE (Cray) Slingshot 64-port switch, which is responsible for sending data in and out of compute blades. The recent interview points out a rather interesting finding: exactly AMD Instinct MI250X GPUs and Slingshot interconnect cause hardware troubles for the Frontier. "It's mostly issues of scale coupled with the breadth of applications, so the issues we're encountering mostly relate to running very, very large jobs using the entire system … and getting all the hardware to work in concert to do that," says Justin Whitt. In addition to the limits of scale "The issues span lots of different categories, the GPUs are just one. A lot of challenges are focused around those, but that's not the majority of the challenges that we're seeing," he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."

Many applications cannot run on hardware of that size, so unique tuning is needed. With the hardware issues that AMD GPUs provide, it is a bit harder to have an operational system on time. However, the Oak Ridge team is confident in their expertise and has no trouble meeting deadlines. For more information read the InsideHPC interview.

View at TechPowerUp Main Site | Source

Dirt Chip · Oct 10, 2022

Sucks to be an early adopter on a multi 100s million dollar product

Bwaze · Oct 10, 2022

Those pesky thousands of vacuum tubes that constantly need replacement...

pavle · Oct 10, 2022

We'll see what kind of leadership they're offering and yeah, it's easy to spend so much money if it's not yours. Perhaps they'll compute their way to heaven.

Crackong · Oct 10, 2022

60 million parts...
Even a 0.001% chance of malfunction would mean 100% in this scale
There are always more than 1 component having malfunction in any given time of operation.

Vayra86 · Oct 10, 2022

Pic says it all, too many wires!

nguyen · Oct 10, 2022

So, basically AMD FineWine

ARF · Oct 10, 2022

This is HPE's fault in its Slingshot Switch.

Chomiq · Oct 10, 2022

Nothing burger:

We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary.

ratirt · Oct 10, 2022

They will get it working properly. Normal stuff.

Wirko · Oct 10, 2022

We don't know what are the consequences of parts failing. Are the Instincts and the switches redundant, so if a few of them fail, computing continues? How many can fail at the same time? Also, are they hot swappable?

Dirt Chip · Oct 10, 2022

I would kill the quick pool, nothing good will come out of it.

ZoneDymo · Oct 10, 2022

"he said. "It's a pretty good spread among common culprits of parts failures that have been a big part of it. I don't think that at this point that we have a lot of concern over the AMD products. We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary.""

What is this reasonable level-headedness?!?
I need outrage! panic! outright insanity!

Count von Schwalbe · Oct 10, 2022

ZoneDymo said:
What is this reasonable level-headedness?!?
I need outrage! panic! outright insanity!

To be found in the title/headline...

Camm · Oct 10, 2022

Put your hand up if you've ever been involved with HPC clusters?

The reality is these things always take time to bed in, even if you are buying a cluster based off a pre-existing solution. In no way surprised they are having issues with the interconnects, its ALWAYS the fucking interconnects, lol.

docnorth · Oct 10, 2022

I know it's wrong to project my experience with desktop PC's to supercomputers, but still couldn't resist and voted that Aurora "will come with fewer hiccups". Not guessing faster or slower, just smoother.

Leiesoldat · Oct 10, 2022

docnorth said:
I know it's wrong to project my experience with desktop PC's to supercomputers, but still couldn't resist and voted that Aurora "will come with fewer hiccups". Not guessing faster or slower, just smoother.

Hahaha that's a laugh. Aurora is still delayed and as far as most of the teams can tell Intel has delivered hardly any of the cabinets to Argonne National Laboratory. Aurora is anything but smooth. As far as I can tell, Intel is more worried about their new fab in Ohio than delivering Aurora at all. Plus Aurora was never meant to be an exascale machine, but rather a bridge between Summit and Frontier.

Camm has it right that scale and interconnect are the issues; Slingshot has been a long running problem.

bug · Oct 10, 2022

Fine wine, I guess. Give it 5 years or so and it will work

phanbuey · Oct 10, 2022

Chomiq said:
Nothing burger:

This.

The issues stem from the actual software and system setup, scheduling jobs etc, not necessarily faults with the hardware.

thegnome · Oct 10, 2022

AMD FineWine™ strikes again! Except this is an insanely complex, super power server, there's bound to be issues. Especially with cutting edge tech...

ThomasK · Oct 10, 2022

ARF said:
This is HPE's fault in its Slingshot Switch.

AMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.

bug · Oct 10, 2022

thegnome said:
AMD FineWine™ strikes again! Except this is an insanely complex, super power server, there's bound to be issues. Especially with cutting edge tech...

That depends. Issues are normal up to and including the acceptance test period. After that, it's supposed to work for the most part. There will always be bugs, they are supposed to be rare and far apart. A delivered system is supposed to be usable.

Punkenjoy · Oct 10, 2022

I have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.

So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.

Operandi · Oct 10, 2022

ThomasK said:
AMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.

Yeah, it does because in the context of a supercomputer its probably the most custom hardware in the system and thats all Cray.

P4-630 · Oct 10, 2022

Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...

System Name	Dirt Sheep \| Silent Sheep
Processor	i5-2400 \| 13900K (-0.02mV offset)
Motherboard	Asus P8H67-M LE \| Gigabyte AERO Z690-G, bios F29e Intel baseline
Cooling	Scythe Katana Type 1 \| Noctua NH-U12A chromax.black
Memory	G-skill 28GB DDR3 \| Corsair Vengeance 432GB DDR5 5200Mhz C40 @4000MHz
Video Card(s)	Gigabyte 970GTX Mini \| NV 1080TI FE (cap at 50%, 800mV)
Storage	2SN850 1TB, 230S 4TB, 840EVO 128GB, WD green 2TB HDD, IronWolf 6TB, 2HC550 18TB in RAID1
Display(s)	LG 21` FHD W2261VP \| Lenovo 27` 4K Qreator 27
Case	Thermaltake V3 Black\|Define 7 Solid, stock 314 fans+ 212 front&buttom+ out 1*8 (on expansion slot)
Audio Device(s)	Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply	Enermax Pro82+ 525W \| Corsair RM650x (2021)
Mouse	Logitech Master 3
Keyboard	Roccat Isku FX
VR HMD	Nop.
Software	WIN 10 \| WIN 11
Benchmark Scores	CB23 SC: i5-2400=641 \| i9-13900k=2325-2281 MC: i5-2400=i9 13900k SC \| i9-13900k=37240-35500

System Name	Personal Gaming Rig
Processor	Ryzen 7800X3D
Motherboard	MSI X670E Carbon
Cooling	MO-RA 3 420
Memory	32GB 6000MHz
Video Card(s)	RTX 4090 ICHILL FROSTBITE ULTRA
Storage	4x 2TB Nvme
Display(s)	Samsung G8 OLED
Case	Silverstone FT04

Processor	7800X3D
Motherboard	MSI MAG Mortar b650m wifi
Cooling	Thermalright Peerless Assassin
Memory	32GB Corsair Vengeance 30CL6000
Video Card(s)	ASRock RX7900XT Phantom Gaming
Storage	Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s)	Gigabyte G34QWC (3440x1440)
Case	Lian Li A3 mATX White
Audio Device(s)	Harman Kardon AVR137 + 2.1
Power Supply	EVGA Supernova G2 750W
Mouse	Steelseries Aerox 5
Keyboard	Lenovo Thinkpad Trackpoint II
Software	W11 IoT Enterprise LTSC
Benchmark Scores	Over 9000

System Name	The de-ploughminator Mk-III
Processor	9800X3D
Motherboard	Gigabyte X870E Aorus Master
Cooling	DeepCool AK620
Memory	2x32GB G.SKill 6400MT Cas32
Video Card(s)	Asus RTX4090 TUF
Storage	4TB Samsung 990 Pro
Display(s)	48" LG OLED C4
Case	Corsair 5000D Air
Audio Device(s)	KEF LSX II LT speakers + KEF KC62 Subwoofer
Power Supply	Corsair HX850
Mouse	Razor Death Adder v3
Keyboard	Razor Huntsman V3 Pro TKL
Software	win11

Processor	Ryzen 7 5800X3D
Motherboard	Gigabyte X570 Aorus Elite
Cooling	Thermalright Phantom Spirit 120 SE
Memory	2x16 GB Crucial Ballistix 3600 CL16 Rev E @ 3800 CL16
Video Card(s)	RTX3080 Ti FE
Storage	SX8200 Pro 1 TB, Plextor M6Pro 256 GB, WD Blue 2TB
Display(s)	LG 34GN850P-B
Case	SilverStone Primera PM01 RGB
Audio Device(s)	SoundBlaster G6 \| Fidelio X2 \| Sennheiser 6XX
Power Supply	SeaSonic Focus Plus Gold 750W
Mouse	Endgame Gear XM1R
Keyboard	Wooting Two HE

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

AleksandarK

News Editor

Dirt Chip

Bwaze

pavle

Crackong

Vayra86

nguyen

ARF

Chomiq

ratirt

Wirko

Dirt Chip

ZoneDymo

Count von Schwalbe

Moderator

Camm

docnorth

Leiesoldat

lazy gamer & woodworker

bug

phanbuey

thegnome

ThomasK

bug

Punkenjoy

Operandi

P4-630

System Name	Bro2
Processor	Ryzen 5800X
Motherboard	Gigabyte X570 Aorus Elite
Cooling	Corsair h115i pro rgb
Memory	32GB G.Skill Flare X 3200 CL14 @3800Mhz CL16
Video Card(s)	Powercolor 6900 XT Red Devil 1.1v@2400Mhz
Storage	M.2 Samsung 970 Evo Plus 500MB/ Samsung 860 Evo 1TB
Display(s)	LG 27UD69 UHD / LG 27GN950
Case	Fractal Design G
Audio Device(s)	Realtec 5.1
Power Supply	Seasonic 750W GOLD
Mouse	Logitech G402
Keyboard	Logitech slim
Software	Windows 10 64 bit

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

System Name	Cyberline
Processor	Intel Core i7 2600k -> 12600k
Motherboard	Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling	Tuniq Tower 120 -> Custom Watercoolingloop
Memory	Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s)	AMD RX480 -> RX7800XT
Storage	Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s)	Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case	antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s)	Focusrite 2i4 (USB)
Power Supply	Seasonic 620watt 80+ Platinum
Mouse	Elecom EX-G
Keyboard	Rapoo V700
Software	Windows 10 Pro 64bit

System Name	Work Computer \| Unfinished Computer
Processor	Core i7-6700 \| Ryzen 5 5600X
Motherboard	Dell Q170 \| Gigabyte Aorus Elite Wi-Fi
Cooling	A fan? \| Truly Custom Loop
Memory	4x4GB Crucial 2133 C17 \| 4x8GB Corsair Vengeance RGB 3600 C26
Video Card(s)	Dell Radeon R7 450 \| RTX 2080 Ti FE
Storage	Crucial BX500 2TB \| TBD
Display(s)	3x LG QHD 32" GSM5B96 \| TBD
Case	Dell \| Heavily Modified Phanteks P400
Power Supply	Dell TFX Non-standard \| EVGA BQ 650W
Mouse	Monster No-Name $7 Gaming Mouse\| TBD

System Name	ATHENA
Processor	AMD 7950X
Motherboard	ASUS Crosshair X670E Extreme
Cooling	ASUS ROG Ryujin III 360, 13 x Lian Li P28
Memory	2x32GB Trident Z RGB 6000Mhz CL30
Video Card(s)	ASUS 4090 STRIX
Storage	3 x Kingston Fury 4TB, 4 x Samsung 870 QVO
Display(s)	Acer X38S, Wacom Cintiq Pro 15
Case	Lian Li O11 Dynamic EVO
Audio Device(s)	Topping DX9, Fluid FPX7 Fader Pro, Beyerdynamic T1 G2, Beyerdynamic MMX300
Power Supply	Seasonic PRIME TX-1600
Mouse	Xtrfy MZ1 - Zy' Rail, Logitech MX Vertical, Logitech MX Master 3
Keyboard	Logitech G915 TKL
VR HMD	Oculus Quest 2
Software	Windows 11 + Universal Blue

System Name	Office / HP Prodesk 490 G3 MT (ex-office)
Processor	Intel 13700 (90° limit) / Intel i7-6700
Motherboard	Asus TUF Gaming H770 Pro / HP 805F H170
Cooling	Noctua NH-U14S / Stock
Memory	G. Skill Trident XMP 2x16gb DDR5 6400MHz cl32 / Samsung 2x8gb 2133MHz DDR4
Video Card(s)	Asus RTX 3060 Ti Dual OC GDDR6X / Zotac GTX 1650 GDDR6 OC
Storage	Samsung 2tb 980 PRO MZ / Samsung SSD 1TB 860 EVO + WD blue HDD 1TB (WD10EZEX)
Display(s)	Eizo FlexScan EV2455 - 1920x1200 / Panasonic TX-32LS490E 32'' LED 1920x1080
Case	Nanoxia Deep Silence 8 Pro / HP microtower
Audio Device(s)	On board
Power Supply	Seasonic Prime PX750 / OEM 300W bronze
Mouse	MS cheap wired / Logitech cheap wired m90
Keyboard	MS cheap wired / HP cheap wired
Software	W11 / W7 Pro ->10 Pro

System Name	Arda
Processor	AMD Ryzen 5800X3D
Motherboard	Gigabyte X570-I AORUS Pro WiFi
Cooling	Custom Loop - Aquacomputer, Optimus, EK, Bykski
Memory	GSkill Trident Z RGB 32 GB (2x16) DDR4-3200
Video Card(s)	Gigabyte Gaming OC RX 6800XT
Storage	SK Hynix P41 1TB
Display(s)	VIOTEK 3440 x 1440 144 Hz Curved
Case	XTIA Proto-XL
Audio Device(s)	Schiit Modius + Schiit Jotunheim
Power Supply	Seasonic Prime 850W Titanium
Mouse	Xtrfy MZ1 Zy's Rail Wireless
Keyboard	Rainkeebs Yasui - Custom 40% Ortholinear
Software	Windows 11 Pro

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

System Name	stress-less
Processor	9800X3D @ 5.42GHZ
Motherboard	MSI PRO B650M-A Wifi
Cooling	Thermalright Phantom Spirit EVO
Memory	64GB DDR5 6000 CL30-36-36-76
Video Card(s)	RTX 4090 FE
Storage	2TB WD SN850, 4TB WD SN850X
Display(s)	Alienware 32" 4k 240hz OLED
Case	Jonsbo Z20
Audio Device(s)	Yes
Power Supply	Corsair SF750
Mouse	DeathadderV2 X Hyperspeed
Keyboard	65% HE Keyboard
Software	Windows 11
Benchmark Scores	They're pretty good, nothing crazy.

System Name	Incomplete thing 1.0
Processor	Ryzen 2600
Motherboard	B450 Aorus Elite
Cooling	Gelid Phantom Black
Memory	HyperX Fury RGB 3200 CL16 16GB
Video Card(s)	Gigabyte 2060 Gaming OC PRO
Storage	Dual 1TB 970evo
Display(s)	AOC G2U 1440p 144hz, HP e232
Case	CM mb511 RGB
Audio Device(s)	Reloop ADM-4
Power Supply	Sharkoon WPM-600
Mouse	G502 Hero
Keyboard	Sharkoon SGK3 Blue
Software	W10 Pro
Benchmark Scores	2-5% over stock scores

Processor	Ryzen 7 7800X3D
Motherboard	ASRock B650M PG Riptide
Cooling	Wraith Max + 2x Noctua Redux NF-P12
Memory	2x16GB ADATA XPG Lancer Blade DDR5-6000 CL30
Video Card(s)	Powercolor RX 7800 XT Fighter OC
Storage	ADATA Legend 970 2TB PCIe 5.0
Display(s)	Dell 32" S3222DGM - 1440P 165Hz + P2422H
Case	HYTE Y40
Audio Device(s)	Microsoft Xbox TLL-00008
Power Supply	Cooler Master MWE 750 V2
Mouse	Alienware AW320M
Keyboard	Alienware AW510K
Software	Windows 11 Pro

System Name	AlderLake
Processor	Intel i7 12700K P-Cores @ 5Ghz
Motherboard	Gigabyte Z690 Aorus Master
Cooling	Noctua NH-U12A 2 fans + Thermal Grizzly Kryonaut Extreme + 5 case fans
Memory	32GB DDR5 Corsair Dominator Platinum RGB 6000MT/s CL36
Video Card(s)	MSI RTX 2070 Super Gaming X Trio
Storage	Samsung 980 Pro 1TB + 970 Evo 500GB + 850 Pro 512GB + 860 Evo 1TB x2
Display(s)	23.8" Dell S2417DG 165Hz G-Sync 1440p
Case	Be quiet! Silent Base 600 - Window
Audio Device(s)	Panasonic SA-PMX94 / Realtek onboard + B&O speaker system / Harman Kardon Go + Play / Logitech G533
Power Supply	Seasonic Focus Plus Gold 750W
Mouse	Logitech MX Anywhere 2 Laser wireless
Keyboard	RAPOO E9270P Black 5GHz wireless
Software	Windows 11
Benchmark Scores	Cinebench R23 (Single Core) 1936 @ stock Cinebench R23 (Multi Core) 23006 @ stock

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

News Editor

Moderator

​

lazy gamer & woodworker

​