Tuesday, May 7th 2019

AMD Collaborates with US DOE to Deliver the Frontier Supercomputer

The U.S. Department of Energy today announced a contract with Cray Inc. to build the Frontier supercomputer at Oak Ridge National Laboratory, which is anticipated to debut in 2021 as the world's most powerful computer with a performance of greater than 1.5 exaflops.

Scheduled for delivery in 2021, Frontier will accelerate innovation in science and technology and maintain U.S. leadership in high-performance computing and artificial intelligence. The total contract award is valued at more than $600 million for the system and technology development. The system will be based on Cray's new Shasta architecture and Slingshot interconnect and will feature high-performance AMD EPYC CPU and AMD Radeon Instinct GPU technology.
By solving calculations up to 50 times faster than today's top supercomputers-exceeding a quintillion, or 10^18, calculations per second-Frontier will enable researchers to deliver breakthroughs in scientific discovery, energy assurance, economic competitiveness, and national security. As a second-generation AI system-following the world-leading Summit system deployed at ORNL in 2018-Frontier will provide new capabilities for deep learning, machine learning and data analytics for applications ranging from manufacturing to human health.

"Frontier's record-breaking performance will ensure our country's ability to lead the world in science that improves the lives and economic prosperity of all Americans and the entire world," said U.S. Secretary of Energy Rick Perry. "Frontier will accelerate innovation in AI by giving American researchers world-class data and computing resources to ensure the next great inventions are made in the United States."

Since 2005, Oak Ridge National Laboratory has deployed Jaguar, Titan, and Summit, each the world's fastest computer in its time. The combination of traditional processors with graphics processing units to accelerate the performance of leadership-class scientific supercomputers is an approach pioneered by ORNL and its partners and successfully demonstrated through ORNL's No.1 ranked Titan and Summit supercomputers.

"ORNL's vision is to sustain the nation's preeminence in science and technology by developing and deploying leadership computing for research and innovation at an unprecedented scale," said ORNL Director Thomas Zacharia. "Frontier follows the well-established computing path charted by ORNL and its partners that will provide the research community with an exascale system ready for science on day one."

Researchers with DOE's Exascale Computing Project are developing exascale scientific applications today on ORNL's 200-petaflop Summit system and will seamlessly transition their scientific applications to Frontier in 2021. In addition, the lab's Center for Accelerated Application Readiness is now accepting proposals from scientists to prepare their codes to run on Frontier.
Researchers will harness Frontier's powerful architecture to advance science in such applications as systems biology, materials science, energy production, additive manufacturing and health data science. Visit the Frontier website to learn more about what researchers plan to accomplish in these and other scientific fields.

Frontier will offer best-in-class traditional scientific modeling and simulation capabilities while also leading the world in artificial intelligence and data analytics. Closely integrating artificial intelligence with data analytics and modeling and simulation will drastically reduce the time to discovery by automatically recognizing patterns in data and guiding simulations beyond the limits of traditional approaches.

"We are honored to be part of this historic moment as we embark on supporting extreme-scale scientific endeavors to deliver the next U.S. exascale supercomputer to the Department of Energy and ORNL," said Peter Ungaro, president and CEO of Cray. "Frontier will incorporate foundational new technologies from Cray and AMD that will enable the new exascale era-characterized by data-intensive workloads and the convergence of modeling, simulation, analytics, and AI for scientific discovery, engineering and digital transformation."

Frontier will incorporate several novel technologies co-designed specifically to deliver a balanced scientific capability for the user community. The system will be composed of more than 100 Cray Shasta cabinets with high density compute blades powered by HPC and AI- optimized AMD EPYC processors and Radeon Instinct GPU accelerators purpose-built for the needs of exascale computing. The new accelerator-centric compute blades will support a 4:1 GPU to CPU ratio with high speed AMD Infinity Fabric links and coherent memory between them within the node. Each node will have one Cray Slingshot interconnect network port for every GPU with streamlined communication between the GPUs and network to enable optimal performance for high-performance computing and AI workloads at exascale.

To make this performance seamless to consume by developers, Cray and AMD are co-designing and developing enhanced GPU programming tools optimized for performance, productivity and portability. This will include new capabilities in the Cray Programming Environment and AMD's ROCm open compute platform that will be integrated together into the Cray Shasta software stack for Frontier.

"AMD is proud to be working with Cray, Oak Ridge National Laboratory and the Department of Energy to push the boundaries of high performance computing with Frontier," said Lisa Su, AMD president and CEO. "Today's announcement represents the power of collaboration between private industry and public research institutions to deliver groundbreaking innovations that scientists can use to solve some of the world's biggest problems."

Frontier leverages a decade of exascale technology investments by DOE. The contract award includes technology development funding, a center of excellence, several early-delivery systems, the main Frontier system, and multi-year systems support. The Frontier system is expected to be delivered in 2021, and acceptance is anticipated in 2022.

Frontier will be part of the Oak Ridge Leadership Computing Facility, a DOE Office of Science User Facility. ORNL is managed by UT-Battelle for DOE's Office of Science, the single largest supporter of basic research in the physical sciences in the United States. DOE's Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit DOE's webiste.
Add your own comment

47 Comments on AMD Collaborates with US DOE to Deliver the Frontier Supercomputer

#26
notb
R-T-BAnd if the driver is a closed binary unavailable for your platform?

Least resistance. It is arguably easier to port a driver than pay for a new closed one to integrate with a moving target (OSS kernel).
I'm not sure why porting a driver would be easier than paying for a new closed one. Porting takes time and costs. And you'll likely outsource it either way.

Also, we're talking about an HPC system. Cray delivers the whole package: configured and ready to run.
Posted on Reply
#27
notb
R-T-BBecause you'll need to pay for a port update everytime the open source kernel you are integrating with updates. Open source integrates easier with open source and is inherently cheaper to maintain.
You buy a server with service. Cray (or any other OEM) provides support - including drivers.
It's way more cost effective as well.
Posted on Reply
#28
medi01
Perhaps that's why Threadripper 2 has disappeared.
Posted on Reply
#29
Aquinus
Resident Wat-man
notbWell, the reality is that Nvidia cluster can run CUDA and this can't.
The reality is that nVidia owns the copyrights for CUDA like all of their other IP and everyone else doesn't. If it's not nVidia, it's not going to do CUDA. If your software uses CUDA, that's called vender lock-in.
Posted on Reply
#30
Frick
Fishfaced Nincompoop
I live in a time when there are three exaflop supercomputers in the works. Crazy.
Posted on Reply
#31
lexluthermiester
AleksandarKsupercomputers-exceeding a quintillion, or 1018, calculations per second
?!? A quintilion is much higher than a mere 1018.
Posted on Reply
#32
AleksandarK
News Editor
lexluthermiester?!? A quintilion is much higher than a mere 1018.
Thanks for pointing that out. Should have been 10^18th
Posted on Reply
#33
Patriot
AMD made ROCm and HIP for a reason, HIP can convert cuda code... and often it is faster on nvidia cards after the conversion than before...
ROCm has been lagging behind 2-3 versions in support but they have been catching up quite well to CUDA's feature set.

If you want to do workloads that rely heavily on tensor ops, then you go with V100s, if you need simple double, single or half precision than AMD is a solid option.
If you need just need inferencing... shiiiit options are wide open.

Edit: Shit, they have caught up on version support... rocm.github.io/dl.html
Posted on Reply
#34
R-T-B
notbYou buy a server with service. Cray (or any other OEM) provides support - including drivers.
It's way more cost effective as well.
And Cray has an interest in reducing the price of the service they provide?

As I said, I can't be certain. But I think their parts choices point down that road...
mtcn77You're naive if you think you have an argument.
And your naive if you think calling me naive constitutes an argument.
mtcn77AMD sold the idea to DoE in 2011. Nvidia & Intel followed.
And the wright brothers invented the airplane. Who cares who follows who in relation to marketshare? Furthermore, call me when their system is even in the TOP100 charts.
Posted on Reply
#35
Unregistered
Starting with open solutions is the quickest way to program.
PatriotAMD made ROCm and HIP for a reason, HIP can convert cuda code... and often it is faster on nvidia cards after the conversion than before...
ROCm has been lagging behind 2-3 versions in support but they have been catching up quite well to CUDA's feature set.

If you want to do workloads that rely heavily on tensor ops, then you go with V100s, if you need simple double, single or half precision than AMD is a solid option.
If you need just need inferencing... shiiiit options are wide open.

Edit: Shit, they have caught up on version support... rocm.github.io/dl.html
Interesting to see the need for, and always good to have an exit ramp from vendor lock in.
#36
prtskg
It's written in the article that ROCm software stack will be used. It's open source and with version 2.4, it's quite competitive with CUDA feature wise. Though I wouldn't be surprised if the next supercomputer win belongs to IBM+Nvidia combo. Let's wait and watch.
Posted on Reply
#37
Patriot
R-T-BFurthermore, call me when their system is even in the TOP100 charts.
Titan is still #9 and AMD based... There is a Zen based chinese small node # supercomputer that is #38
This will be #1 and 7x faster than the current #1 and 50% faster than Intel's system that may or may not get finished first.
I think this is the first time I have seen AMD cpu+gpu be in the top 10, everything before has been AMD CPU/Nvidia GPU.
Posted on Reply
#38
HD64G
R-T-BInteresting you accuse me of not reading, because I already addressed this.

Frontier is based on Zen, which is a complete ground up redesign since Bulldozer based fusion. They are about as related as an Apple and a Potato, or Netburst and Sandy Bridge.


Pretending you are smarter than everyone is not making you look great here.

or maybe I am confused... what is your argument here, exactly?

If you are just doing a generic "AMD IS DAH BEST," you at least aren't exclusively wrong. Consider the following:

Zen is better suited for this than Intel for certain. It's their GPU-choice I find intriguing, and only because they could've likely been more power efficient (important in super computer clusters) with nvidias line.
For servers the GPUs aren't rated by their gaming performance but in TFlops=raw compute power. And in that, AMD has efficient GPUs as Vega arch is mainly a compute targeted one for data centers. And in 7nm AMD GPUs are much more efficient. Radeon 7 is ~30% faster than Vega 64 using 10% less power.
Posted on Reply
#39
Caring1
R-T-BAnd the wright brothers invented the airplane.
And here I was thinking they were the first to actually do a documented flight, and they invented planes too....
Posted on Reply
#40
Prima.Vera
SoNic6731MW of power. Talking about "global warming"? :laugh:
We are in 2019. 4 wind turbines like thosecan provide more than adequate power. Also they can install additional solar panels on the roof and can have 100% ECO energy. ;)
Posted on Reply
#41
mtcn77
Prima.VeraWe are in 2019. 4 wind turbines like thosecan provide more than adequate power. Also they can install additional solar panels on the roof and can have 100% ECO energy. ;)
31MWh is still 7.4 megacalories per second, or 7.4°C/s per ton of water...
Posted on Reply
#42
SoNic67
Prima.VeraWe are in 2019. 4 wind turbines like thosecan provide more than adequate power. Also they can install additional solar panels on the roof and can have 100% ECO energy
And DOE can make calculations only when wind blows. Good investment.
Seriously, every single one of those wind turbines has to be paired up with a quick-reacting gas turbine, ready to pick up the load. Spinning reserve. Those are expensive to buy, to maintain and they are not fuel efficient compared to a slow reacting coal or nuclear plants (those cannot be used as spinning reserves because cannot be turned on-off so fast).

So by using the "free" wind, you just increased the price of electricity...
R-T-BDid you miss the part where I worked here just a year ago?
What happen, did they finally fired you? :D











Just joking, sorry, could not help, was so easy...
Posted on Reply
#43
bogmali
In Orbe Terrum Non Visi
This thread is not for chest beating purposes, you can continue those offline or via PMs. Thread cleansed and reply bans issued.
Posted on Reply
#44
Prima.Vera
SoNic67Seriously, every single one of those wind turbines has to be paired up with a quick-reacting gas turbine, ready to pick up the load. Spinning reserve.
No need for those. The Datacenter will be connected to the main grid anyway ready to pick up the missing load in case no wind/sun from the Mother Nature.

But I am also curious about the future upgradebilitty for those server farms. Can the CPUs/GPUs be easily upgraded in the future on the fly? And I mean without changing motherboards and such?
Posted on Reply
#45
SoNic67
Prima.VeraThe Datacenter will be connected to the main grid anyway ready to pick up the missing load in case no wind/sun from the Mother Nature.
You don't get it. That fabled "main grid" ready to jump to help... it's composed of many individual steam or gas turbines like I said above.
In every microsecond of the day, 24/7, 365 days/year, the amount of electricity produced has to be equal to the electricity consumed. If a power generator drops quickly, somewhere in the system, preferably close by, another one has to pick up as quickly that exact deficit of power.
Steam turbines (coal fired) have a spare capacity of a few minutes of steam, and by that time, the gas turbines have to start-up already to pick up the slack.
You can't mess up with a nuclear power plant up and down that way. Or... you can, but might end up with Chernobyl.
Posted on Reply
#46
Caring1
SoNic67You don't get it. That fabled "main grid" ready to jump to help... it's composed of many individual steam or gas turbines like I said above.
In every microsecond of the day, 24/7, 365 days/year, the amount of electricity produced has to be equal to the electricity consumed.
WUT! :kookoo:
Wrong and so off topic, I'm leaving it there.
Posted on Reply
#47
HwGeek

After Watching this about Milan and the possibility that it will include 15 Chiplets and maybe SMT4, I think [my Imagination] I Have an Idea where AMD is going with it's future(Maybe Custom design?) HPC EPYC design on 7nm+:
1)Each CPU chiplet will be 6C/24T to save space/power while giving similar or better then 8c/16t performance.
2)Adding 4 custom Instinct GPU chiplets.
3)Adding 2 custom AI accelerator [Asics] chiplets.
4)1 I/O chiplet with HBM memory stack.

So the final EPYC Milan(?) can be HPC beast with:
  • 48C/192T Zen CPU cores.
  • 4 custom Instinct GPUs.
  • 2 AI accelerator Asics.
  • 1 I/O Chiplet with HBM 3D staking .


EDIT: I see that there was already great article on such HPC APU design:
www.overclock.net/forum/225-...lops-200w.html
www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf
You can see the EPYC PCB design in "Figure 2. Exascale Heterogeneous Processor (EHP) ".

So after reading some of it I changed my illustration:
No CPU Chiplet ontop of I/O:[Took Vega Pro 20CU image and placed the HBM on top and shrank it to 7nm+ level]
IMO the GPU's could take ~150W + 75W~100W rest of the CPU+I/O= around 250W TDP.


Or CPU Chiplets ontop of I/O- it can still be 14nm or 7nm- but it gonna stay large chiplet anyway to place cpu chiplets on top,



And 8 Milans could be installed in Cray’s Shasta 1U with Direct Liquid Cooling:

www.anandtech.com/show/13616...liquid-cooling
Posted on Reply
Add your own comment
Dec 26th, 2024 01:10 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts