Wednesday, September 11th 2024

Oracle Offers First Zettascale Cloud Computing Cluster

Oracle today announced the first zettascale cloud computing clusters accelerated by the NVIDIA Blackwell platform. Oracle Cloud Infrastructure (OCI) is now taking orders for the largest AI supercomputer in the cloud—available with up to 131,072 NVIDIA Blackwell GPUs.

"We have one of the broadest AI infrastructure offerings and are supporting customers that are running some of the most demanding AI workloads in the cloud," said Mahesh Thiagarajan, executive vice president, Oracle Cloud Infrastructure. "With Oracle's distributed cloud, customers have the flexibility to deploy cloud and AI services wherever they choose while preserving the highest levels of data and AI sovereignty."
World's first Zettascale computing cluster
OCI is now taking orders for the largest AI supercomputer in the cloud—available with up to 131,072 NVIDIA Blackwell GPUs—delivering an unprecedented 2.4 zettaFLOPS of peak performance. The maximum scale of OCI Supercluster offers more than three times as many GPUs as the Frontier supercomputer and more than six times that of other hyperscalers. OCI Supercluster includes OCI Compute Bare Metal, ultra-low latency RoCEv2 with ConnectX-7 NICs and ConnectX-8 SuperNICs or NVIDIA Quantum-2 InfiniBand-based networks, and a choice of HPC storage.

OCI Superclusters are orderable with OCI Compute powered by either NVIDIA H100 or H200 Tensor Core GPUs or NVIDIA Blackwell GPUs. OCI Superclusters with H100 GPUs can scale up to 16,384 GPUs with up to 65 ExaFLOPS of performance and 13 Pb/s of aggregated network throughput. OCI Superclusters with H200 GPUs will scale to 65,536 GPUs with up to 260 ExaFLOPS of performance and 52 Pb/s of aggregated network throughput and will be available later this year. OCI Superclusters with NVIDIA GB200 NVL72 liquid-cooled bare-metal instances will use NVLink and NVLink Switch to enable up to 72 Blackwell GPUs to communicate with each other at an aggregate bandwidth of 129.6 TB/s in a single NVLink domain. NVIDIA Blackwell GPUs, available in the first half of 2025, with fifth-generation NVLink, NVLink Switch, and cluster networking will enable seamless GPU-GPU communication in a single cluster.

"As businesses, researchers and nations race to innovate using AI, access to powerful computing clusters and AI software is critical," said Ian Buck, vice president of Hyperscale and High Performance Computing, NVIDIA. "NVIDIA's full-stack AI computing platform on Oracle's broadly distributed cloud will deliver AI compute capabilities at unprecedented scale to advance AI efforts globally and help organizations everywhere accelerate research, development and deployment."

Customers such as WideLabs and Zoom are leveraging OCI's high-performing AI infrastructure with powerful security and sovereignty controls.

WideLabs trains one of the largest Portuguese LLMs on OCI
WideLabs, an applied AI startup in Brazil, is training one of Brazil's largest LLMs, Amazonia IA, on OCI. They developed bAIgrapher, an application that uses its LLM to generate biographical content based on data collected from patients with Alzheimer's disease to help them preserve important memories.

WideLabs uses the Oracle Cloud São Paulo Region to run its AI workloads, ensuring that sensitive data remains within country borders. This enables WideLabs to adhere to Brazilian AI sovereignty requirements by being able to control where its AI technology is deployed and operated. WideLabs uses OCI AI infrastructure with NVIDIA H100 GPUs to train its LLMs, as well as Oracle Kubernetes Engine to provision, manage, and operate GPU-accelerated containers across an OCI Supercluster consisting of OCI Compute connected with OCI's RMDA-based cluster networking.

"OCI AI infrastructure offers us the most efficiency for training and running our LLMs," said Nelson Leoni, CEO, WideLabs. "OCI's scale and flexibility is invaluable as we continue to innovate in the healthcare space and other key sectors."

Zoom uses OCI's sovereignty capabilities for its generative AI assistant
Zoom, a leading AI-first collaboration platform, is using OCI to provide inference for Zoom AI Companion, the company's AI personal assistant available at no additional cost. Zoom AI Companion helps users draft emails and chat messages, summarize meetings and chat threads, generate ideas during brainstorms with colleagues, and more. OCI's data and AI sovereignty capabilities will help Zoom keep customer data locally in region and support AI sovereignty requirements in Saudi Arabia, where OCI's solution is being rolled out initially.

"Zoom AI Companion is revolutionizing the way organizations work, with cutting-edge generative AI capabilities available at no additional cost with customers' paid accounts," said Bo Yan, head of AI, Zoom. "By harnessing OCI's AI inference capabilities, Zoom is able to deliver accurate results at low latency, empowering users to collaborate seamlessly, communicate effortlessly, and boost productivity, efficiency, and potential like never before."
Source: Oracle
Add your own comment

9 Comments on Oracle Offers First Zettascale Cloud Computing Cluster

#1
TumbleGeorge
Too big numbers. Let's suppose with int4, fp4 precision.
Posted on Reply
#2
yfn_ratchet
A mind-boggling tier of compute... this must have cost them something on the order of billions just for parts alone, no? Must be a lot of money changing hands for this to be worth it for Oracle.
Posted on Reply
#4
ScaLibBDP
TumbleGeorgeToo big numbers. Let's suppose with int4, fp4 precision.
The source did Not specify what Floating Point data type is used to get these estimates. I think this is with Half-Precision Floating Point arithmetic ( FP16 ).

In overall, an Rpeak ( the theoretical peak performance ) of the system for Half-Precision Floating Point arithmetic ( FP16 ) could be calculated as follows:

Rpeak = 131,072 ( Total Number of NVIDIA GH200 GPUs ) x 1979 TFLOPs ( for FP16 Tensor Core ) = ~259391488 TFLOPs = ~259391.49 PFLOPs = ~259.39 EFLOPs = ~0.26 ZFLOPs

I would rate the system as a Zeta-Scale system since it would take just 4 seconds to achieve Rpeak of over 1 ZFLOPs ( 4 x ~259.39 EFLOPs = ~1.04 ZFLOPs ).

NVIDIA GH200 specs
resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip
Posted on Reply
#5
TumbleGeorge
ScaLibBDPThe source did Not specify what Floating Point data type is used to get these estimates. I think this is with Half-Precision Floating Point arithmetic ( FP16 ).

In overall, an Rpeak ( the theoretical peak performance ) of the system for Half-Precision Floating Point arithmetic ( FP16 ) could be calculated as follows:

Rpeak = 131,072 ( Total Number of NVIDIA GH200 GPUs ) x 1979 TFLOPs ( for FP16 Tensor Core ) = ~259391488 TFLOPs = ~259391.49 PFLOPs = ~259.39 EFLOPs = ~0.26 ZFLOPs

I would rate the system as a Zeta-Scale system since it would take just 4 seconds to achieve Rpeak of over 1 ZFLOPs ( 4 x ~259.39 EFLOPs = ~1.04 ZFLOPs ).

NVIDIA GH200 specs
resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip
FP16 for "AI"? I think to disagree with you. ;)
Posted on Reply
#6
ScaLibBDP
TumbleGeorgeFP16 for "AI"? I think to disagree with you. ;)
Any Floating Point data types could be used for AI even Double Precision ( FP64 ).

Here is an example, let's say there is some AI software and it could be configured to process with:

- FP64 - it could complete a training task in 4 hours with the best accuracy and best dynamic range of values
- FP32 - it could complete a training task in 2 hours with the good accuracy and good dynamic range of values
- FP16 - it could complete a training task in 1 hour with the so-so accuracy and so-so dynamic range of values
- FP8 - it could complete a training task in 0.5 hour with the low accuracy and narrow dynamic range of values
- FP4 - it could complete a training task in 0.25 hour with the lowest accuracy and very narrow dynamic range of values

There are No Strict Rules on what a Floating Point ( FP ) data type should be used for AI processing, that is for training or inference.

Some AI software systems are Adaptive (!) and use different FP data types at different phases of processing.
Posted on Reply
#7
TumbleGeorge
However, I don't see how with today's hardware one can build a zetta supercomputer with high precision calculations without being powered by at least a dozen nuclear power plants with large reactors.
Posted on Reply
#8
yfn_ratchet
TumbleGeorgeHowever, I don't see how with today's hardware one can build a zetta supercomputer with high precision calculations without being powered by at least a dozen nuclear power plants with large reactors.
Reading off the count in the article and working off official spec of the GB200 systems, the power draw of the system itself puts us at about 180MW full bore, with an additional, eh, 50MW for the cooling system as well. Being a little loose here with it, I doubt we'll ever see peak power out of these, but 230MW for a single maxed-out supercluster translates to (and I'll go with US plants since I imagine there being a fair few American clients for these):

- The Agua Caliente Solar Plant in Arizona in suboptimal conditions (Solar)
- 5x NuScale VOYGR SMR devices (On-Premises Nuclear)
- 6.1% of the West County Energy Center in Florida (Natural Gas)
- 35.5% of the Monticello NPP in Minnesota (Nuclear)
- 6.7% of the Monroe Power Plant in Michigan (Coal)

It's a lot, but it's not apocalyptic levels of draw. Peak working hours—the middle of the day—also represents a time at which residential power draw is at an all time low as people are not at home charging an EV, cranking the AC, using electric stoves, etc. etc. Solar is also running at/near peak output at this time of day, too, depending on cloud cover.
Posted on Reply
#9
ScaLibBDP
TumbleGeorgeHowever, I don't see how with today's hardware one can build a zetta supercomputer with high precision calculations without being powered by at least a dozen nuclear power plants with large reactors.
That is a Well Known Problem and in our HPC community many experts are talking about it. If interested to learn more take a look at:

www.hpcwire.com/2024/08/27/hpc-debrief-james-walker-ceo-of-nano-nuclear-energy-on-powering-datacenters

There is also another article about xAI Colossus supercomputer at:

www.hpcwire.com/2024/09/05/xai-colossus-the-elon-project
Posted on Reply
Nov 18th, 2024 18:25 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts