AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

Count von Schwalbe · Oct 10, 2022

P4-630 said:
Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...

Intel/Intel (Aurora):

The Intel supercomputer, which was repeatedly delayed and reworked, is now expected to be "comfortably over 2 exaflops" in peak compute performance, thanks to Intel's new GPUs performing better than expected.

Also, the "Summit" supercomputer uses IBM CPUs and Nvidia Tesla GPUs, so ORNL would be having a fit if "Frontier" was much worse.

Oberon · Oct 10, 2022

ThomasK said:
AMD is the one providing the solution to the customer, doesn't matter who's switch is being used, AMD is taking the blame.

Tell me you have no experience outside of consumer hardware without telling me you have no experience outside of consumer hardware.

mechtech · Oct 10, 2022

Punkenjoy said:
I have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.

So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.

Must be a PITA troubleshooting and correcting that lol

Oberon · Oct 10, 2022

Punkenjoy said:
So to me, this article is more something to please the AMD bashing communities than anything else.

Lots of that going on here lately.

R-T-B · Oct 10, 2022

Dirt Chip said:
Sucks to be an early adopter on a multi 100s million dollar product

Sir that's always what you are on a multimillion dollar build. You think they poop these out daily?

Chomiq said:
Nothing burger:

Basically, yeah.

Oberon said:
Lots of that going on here lately.

Clickbait gets clicks, sadly.

r9 · Oct 10, 2022

Has to do with dual sided dimms

Vayra86 · Oct 10, 2022

Punkenjoy said:
I have build clusters, render farm and other types of super computers in one of my previous job.

Like they said, this is indeed expected. You have all kind of issue, bad cables, bad memory, etc. If you have 1% defect rate and you build a 1000 nodes system, that means 10 systems will have defect.

After that the fun start, try to find the source of the problem, trying to isolate it. It takes times and effort and the larger the cluster is, the harder it can be.

Render farm are most of the time easier since they just use the network and will crash by itself. A cluster have also the interconnect that can fail. You run codes on multiples nodes and it's not always clear where it fail. Sometime one node will crash because it received corrupted data from another nodes. Sometime it's the switch, the storage, etc. Way more parts to fail than a regular PC and trying to pin point a failure can sometime be really a pain in the ass and take days.

So to me, this article is more something to please the AMD bashing communities than anything else. I build both AMD/Intel systems and it's was not really much the CPU vendor that really effected defect rates. Larger cluster required more time to settle.

Definition of CLICKBAIT

something (such as a headline) designed to make readers want to click on a hyperlink especially when the link leads to content of dubious value or interest… See the full definition

www.merriam-webster.com

You can use a white pitchfork for this one instead of a red one.

r9 · Oct 10, 2022

From Can't Operate a Day without Issues" to
"We're dealing with a lot of the early-life kind of things we've seen with other machines that we've deployed, so it's nothing too out of the ordinary."
It doesn't get any more clickbait than that.

thesmokingman · Oct 10, 2022

Lmao, TPU added that trollish bit. That's really poor form man.

r9 · Oct 10, 2022

thesmokingman said:
Lmao, TPU added that trollish bit. That's really poor form man.

News title: "Politician caught wearing women's clothes in public"
Inside picture of Hilary Clinton.

Space Lynx · Oct 10, 2022

Dirt Chip said:
Sucks to be an early adopter on a multi 100s million dollar product

if you actually read the original article. it states these kind of obstacles are in the norm for something of this size.

this is just clickbait garbage. humans bore me. i guess i need to start drinking now

N3utro · Oct 10, 2022

Did they try to turn it off and on again?

PapaTaipei · Oct 10, 2022

All that compute power to spy EVERYTHING and EVERYONE, EVERYWHERE.

Mussels · Oct 11, 2022

Crackong said:
60 million parts...
Even a 0.001% chance of malfunction would mean 100% in this scale
There are always more than 1 component having malfunction in any given time of operation.

This.

Exascale, Exaproblems.

CallandorWoT said:
if you actually read the original article. it states these kind of obstacles are in the norm for something of this size.

this is just clickbait garbage. humans bore me. i guess i need to start drinking now

Well yeah, but it's also how you spot the people who lack the ability to think and leap on answers that fit an existing worldview

AlwaysHope · Oct 11, 2022

They cheaped out on the cables.. .suck it! :laugh:

R-T-B · Oct 11, 2022

PapaTaipei said:
All that compute power to spy EVERYTHING and EVERYONE, EVERYWHERE.

Is it conspiracy hour already?

Count von Schwalbe · Oct 11, 2022

R-T-B said:
Is it conspiracy hour already?

#popcorn

mkppo · Oct 11, 2022

Terrible title..

The reality is that this is absolutely normal.

Dirt Chip · Oct 11, 2022

CallandorWoT said:
if you actually read the original article. it states these kind of obstacles are in the norm for something of this size.

this is just clickbait garbage. humans bore me. i guess i need to start drinking now

You are right- It is garbage, and the pull is not helping and should be killed.
But this is my escapism, so please don't judge me for worse

bug · Oct 11, 2022

P4-630 said:
Did we ever hear from an intel/nvidia supercomputer that had startup issues?.... Not that I know...

It's till not clear what kind of issues are we talking about here. If this is about the acceptance phase, yes, there will be issues galore, nothing to write home about. If it's post-acceptance issues, that could be a problem. We also don't know how much is hardware and how much is software related (it's possible this is what they're trying to figure out right now).

HenrySomeone · Oct 11, 2022

Not surprising, you always get what you pay for...

ThomasK · Oct 11, 2022

All of a sudden, lots of people here in the comment section seem to have a vast experience with supercomputers...

Gathered in another forum's comment section.

thesmokingman · Oct 11, 2022

ThomasK said:
All of a sudden, lots of people here in the comment section seem to have a vast experience with supercomputers...

Gathered in another forum's comment section.

Nah, click baity titles like this are troll and shill magne as Mussels alluded to. You can tell which is which.

Mussels · Oct 13, 2022

thesmokingman said:
Nah, click baity titles like this are troll and shill magne as Mussels alluded to. You can tell which is which.

It's pretty funny.

It does explain some peoples purchasing and hardware/brand preferences, if they can't get past the headlines to actually read the content

System Name	Work Computer \| Unfinished Computer
Processor	Core i7-6700 \| Ryzen 5 5600X
Motherboard	Dell Q170 \| Gigabyte Aorus Elite Wi-Fi
Cooling	A fan? \| Truly Custom Loop
Memory	4x4GB Crucial 2133 C17 \| 4x8GB Corsair Vengeance RGB 3600 C26
Video Card(s)	Dell Radeon R7 450 \| RTX 2080 Ti FE
Storage	Crucial BX500 2TB \| TBD
Display(s)	3x LG QHD 32" GSM5B96 \| TBD
Case	Dell \| Heavily Modified Phanteks P400
Power Supply	Dell TFX Non-standard \| EVGA BQ 650W
Mouse	Monster No-Name $7 Gaming Mouse\| TBD

Processor	AMD Ryzen 5900X
Motherboard	MSI MAG X570 Tomahawk
Cooling	Dual custom loops
Memory	4x8GB G.SKILL Trident Z Neo 3200C14 B-Die
Video Card(s)	AMD Radeon RX 6800XT Reference
Storage	ADATA SX8200 480GB, Inland Premium 2TB, various HDDs
Display(s)	MSI MAG341CQ
Case	Meshify 2 XL
Audio Device(s)	Schiit Fulla 3
Power Supply	Super Flower Leadex Titanium SE 1000W
Mouse	Glorious Model D
Keyboard	Drop CTRL, lubed and filmed Halo Trues

Processor	Ryzen 5700x
Motherboard	Gigabyte X570S Aero G R1.1 BiosF5g
Cooling	Noctua NH-C12P SE14 w/ NF-A15 HS-PWM Fan 1500rpm
Memory	Micron DDR4-3200 2x32GB D.S. D.R. (CT2K32G4DFD832A)
Video Card(s)	AMD RX 6800 - Asus Tuf
Storage	Kingston KC3000 1TB & 2TB & 4TB Corsair MP600 Pro LPX
Display(s)	LG 27UL550-W (27" 4k)
Case	Be Quiet Pure Base 600 (no window)
Audio Device(s)	Realtek ALC1220-VB
Power Supply	SuperFlower Leadex V Gold Pro 850W ATX Ver2.52
Mouse	Mionix Naos Pro
Keyboard	Corsair Strafe with browns
Software	W10 22H2 Pro x64

Processor	AMD Ryzen 5900X
Motherboard	MSI MAG X570 Tomahawk
Cooling	Dual custom loops
Memory	4x8GB G.SKILL Trident Z Neo 3200C14 B-Die
Video Card(s)	AMD Radeon RX 6800XT Reference
Storage	ADATA SX8200 480GB, Inland Premium 2TB, various HDDs
Display(s)	MSI MAG341CQ
Case	Meshify 2 XL
Audio Device(s)	Schiit Fulla 3
Power Supply	Super Flower Leadex Titanium SE 1000W
Mouse	Glorious Model D
Keyboard	Drop CTRL, lubed and filmed Halo Trues

System Name	Pioneer
Processor	Ryzen R9 9950X
Motherboard	GIGABYTE Aorus Elite X670 AX
Cooling	Noctua NH-D15 + A whole lotta Sunon and Corsair Maglev blower fans...
Memory	64GB (4x 16GB) G.Skill Flare X5 @ DDR5-6000 CL30
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 905p Optane 960GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64 / Windows 11 Enterprise IoT 2024

AMD-Powered Frontier Supercomputer Faces Difficulties, Can't Operate a Day without Issues

Count von Schwalbe

Moderator

Oberon

mechtech

Oberon

R-T-B

r9

Vayra86

Definition of CLICKBAIT

r9

thesmokingman

r9

Space Lynx

Astronaut

N3utro

PapaTaipei

Mussels

Freshwater Moderator

AlwaysHope

R-T-B

Count von Schwalbe

Moderator

mkppo

Dirt Chip

bug

HenrySomeone

ThomasK

thesmokingman

Mussels

Freshwater Moderator

System Name	Primary\|Secondary\|Poweredge r410\|Dell XPS\|SteamDeck
Processor	i7 11700k\|i7 9700k\|2 x E5620 \|i5 5500U\|Zen 2 4c/8t
Memory	32GB DDR4\|16GB DDR4\|16GB DDR4\|32GB ECC DDR3\|8GB DDR4\|16GB LPDDR5
Video Card(s)	RX 7800xt\|RX 6700xt \|On-Board\|On-Board\|8 RDNA 2 CUs
Storage	2TB m.2\|512GB SSD+1TB SSD\|2x256GBSSD 2x2TBGB\|256GB sata\|512GB nvme
Display(s)	50" 4k TV \| Dell 27" \|22" \|3.3"\|7"
VR HMD	Samsung Odyssey+ \| Oculus Quest 2
Software	Windows 11 Pro\|Windows 10 Pro\|Windows 10 Home\| Server 2012 r2\|Windows 10 Pro

Processor	7800X3D
Motherboard	MSI MAG Mortar b650m wifi
Cooling	Thermalright Peerless Assassin
Memory	32GB Corsair Vengeance 30CL6000
Video Card(s)	ASRock RX7900XT Phantom Gaming
Storage	Lexar NM790 4TB + Samsung 850 EVO 1TB + Samsung 980 1TB + Crucial BX100 250GB
Display(s)	Gigabyte G34QWC (3440x1440)
Case	Lian Li A3 mATX White
Audio Device(s)	Harman Kardon AVR137 + 2.1
Power Supply	EVGA Supernova G2 750W
Mouse	Steelseries Aerox 5
Keyboard	Lenovo Thinkpad Trackpoint II
Software	W11 IoT Enterprise LTSC
Benchmark Scores	Over 9000

Processor	AMD 5900x
Motherboard	Asus x570 Strix-E
Cooling	Hardware Labs
Memory	G.Skill 4000c17 2x16gb
Video Card(s)	RTX 3090
Storage	Sabrent
Display(s)	Samsung G9
Case	Phanteks 719
Audio Device(s)	Fiio K5 Pro
Power Supply	EVGA 1000 P2
Mouse	Logitech G600
Keyboard	Corsair K95

System Name	1080p 144hz
Processor	7800X3D
Motherboard	Asus X670E crosshair hero
Cooling	Noctua NH-D15
Memory	G.skill flare X5 2*16 GB DDR5 6000 Mhz CL30
Video Card(s)	Nvidia RTX 4070 FE
Storage	Western digital SN850 1 TB NVME
Display(s)	Asus PG248Q
Case	Phanteks P600S
Audio Device(s)	Logitech pro X2 lightspeed
Power Supply	EVGA 1200 P2
Mouse	Logitech G PRO
Keyboard	Logitech G710+
Benchmark Scores	https://www.3dmark.com/sw/1143551

System Name	Rainbow Sparkles (Power efficient, <350W gaming load)
Processor	Ryzen R7 5800x3D (Undervolted, 4.45GHz all core)
Motherboard	Asus x570-F (BIOS Modded)
Cooling	Alphacool Apex UV - Alphacool Eisblock XPX Aurora + EK Quantum ARGB 3090 w/ active backplate
Memory	2x32GB DDR4 3600 Corsair Vengeance RGB @3866 C18-22-22-22-42 TRFC704 (1.4V Hynix MJR - SoC 1.15V)
Video Card(s)	Galax RTX 3090 SG 24GB: Underclocked to 1700Mhz 0.750v (375W down to 250W))
Storage	2TB WD SN850 NVME + 1TB Sasmsung 970 Pro NVME + 1TB Intel 6000P NVME USB 3.2
Display(s)	Phillips 32 32M1N5800A (4k144), LG 32" (4K60) \| Gigabyte G32QC (2k165) \| Phillips 328m6fjrmb (2K144)
Case	Fractal Design R6
Audio Device(s)	Logitech G560 \| Corsair Void pro RGB \|Blue Yeti mic
Power Supply	Fractal Ion+ 2 860W (Platinum) (This thing is God-tier. Silent and TINY)
Mouse	Logitech G Pro wireless + Steelseries Prisma XL
Keyboard	Razer Huntsman TE ( Sexy white keycaps)
VR HMD	Oculus Rift S + Quest 2
Software	Windows 11 pro x64 (Yes, it's genuinely a good OS) OpenRGB - ditch the branded bloatware!
Benchmark Scores	Nyooom.

System Name	Dirt Sheep \| Silent Sheep
Processor	i5-2400 \| 13900K (-0.02mV offset)
Motherboard	Asus P8H67-M LE \| Gigabyte AERO Z690-G, bios F29e Intel baseline
Cooling	Scythe Katana Type 1 \| Noctua NH-U12A chromax.black
Memory	G-skill 28GB DDR3 \| Corsair Vengeance 432GB DDR5 5200Mhz C40 @4000MHz
Video Card(s)	Gigabyte 970GTX Mini \| NV 1080TI FE (cap at 50%, 800mV)
Storage	2SN850 1TB, 230S 4TB, 840EVO 128GB, WD green 2TB HDD, IronWolf 6TB, 2HC550 18TB in RAID1
Display(s)	LG 21` FHD W2261VP \| Lenovo 27` 4K Qreator 27
Case	Thermaltake V3 Black\|Define 7 Solid, stock 314 fans+ 212 front&buttom+ out 1*8 (on expansion slot)
Audio Device(s)	Beyerdynamic DT 990 (or the screen speakers when I'm too lazy)
Power Supply	Enermax Pro82+ 525W \| Corsair RM650x (2021)
Mouse	Logitech Master 3
Keyboard	Roccat Isku FX
VR HMD	Nop.
Software	WIN 10 \| WIN 11
Benchmark Scores	CB23 SC: i5-2400=641 \| i9-13900k=2325-2281 MC: i5-2400=i9 13900k SC \| i9-13900k=37240-35500

Processor	Intel i5-12600k
Motherboard	Asus H670 TUF
Cooling	Arctic Freezer 34
Memory	2x16GB DDR4 3600 G.Skill Ripjaws V
Video Card(s)	EVGA GTX 1060 SC
Storage	500GB Samsung 970 EVO, 500GB Samsung 850 EVO, 1TB Crucial MX300 and 2TB Crucial MX500
Display(s)	Dell U3219Q + HP ZR24w
Case	Raijintek Thetis
Audio Device(s)	Audioquest Dragonfly Red :D
Power Supply	Seasonic 620W M12
Mouse	Logitech G502 Proteus Core
Keyboard	G.Skill KM780R
Software	Arch Linux + Win10

Processor	Ryzen 7 7800X3D
Motherboard	ASRock B650M PG Riptide
Cooling	Wraith Max + 2x Noctua Redux NF-P12
Memory	2x16GB ADATA XPG Lancer Blade DDR5-6000 CL30
Video Card(s)	Powercolor RX 7800 XT Fighter OC
Storage	ADATA Legend 970 2TB PCIe 5.0
Display(s)	Dell 32" S3222DGM - 1440P 165Hz + P2422H
Case	HYTE Y40
Audio Device(s)	Microsoft Xbox TLL-00008
Power Supply	Cooler Master MWE 750 V2
Mouse	Alienware AW320M
Keyboard	Alienware AW510K
Software	Windows 11 Pro