Thursday, May 18th 2023

India Homegrown HPC Processor Arrives to Power Nation's Exascale Supercomputer

With more countries creating initiatives to develop homegrown processors capable of powering powerful supercomputing facilities, India has just presented its development milestone with Aum HPC. Thanks to information from the report by The Next Platform, we learn that India has developed a processor for powering its exascale high-performance computing (HPC) system. Called Aum HPC, the CPU was developed by the National Supercomputing Mission of the Indian government, which funded the Indian Institute of Science, the Department of Science and Technology, the Ministry of Electronics and Information Technology, and C-DAC to design and manufacture the Aum HPC processors and create strong, strong technology independence.

The Aum HPC is based on Armv8.4 CPU ISA and represents a chiplet processor. Each compute chiplet features 48 Arm Zeus Cores based on Neoverse V1 IP, so with two chiplets, the processor has 96 cores in total. Each core gets 1 MB of level two cache and 1 MB of system cache, for 96 MB L2 cache and 96 MB system cache in total. For memory, the processor uses 16-channel 32-bit DDR5-5200 with a bandwidth of 332.8 GB/s. To expand on that, HBM memory is present, and there is 64 GB of HBM3 with four controllers capable of achieving a bandwidth of 2.87 TB/s. As far as connectivity, the Aum HPC processor has 64 PCIe Gen 5 Lanes with CXL enabled. It is manufactured on a 5 nm node from TSMC. With a 3.0 GHz typical and 3.5+ GHz turbo frequency, the Aum HPC processor is rated for a TDP of 300 Watts. It is capable of producing 4.6+ TeraFLOPS per socket. Below are illustrations and tables comparing Aum HPC to Fujitsy A64FX, another Arm HPC-focused design.
Source: The Next Platform
Add your own comment

4 Comments on India Homegrown HPC Processor Arrives to Power Nation's Exascale Supercomputer

#1
Bwaze
"With a 3.0 GHz typical and 3.5+ GHz turbo frequency, the Aum HPC processor is rated for a TDP of 300 Watts. It is capable of producing 4.6+ TeraFLOPS per socket."

Isn't that a bit much? AMD Epyc Milan (2021) only comes half way there, with similar TDP, AMD 7763 - 64C/128T 2.45G 280W. Or am I looking at different benchmarks (it isn't specified what are both of them measuring)?


Posted on Reply
#2
Lianna
I need to temper my expectations as to what is considered to "develop homegrown processors" and "powerful supercomputing":
"design and manufacture the Aum HPC processors and create strong, strong technology independence"
Oh, so they created a new architecture from scratch!
"based on Armv8.4 CPU ISA"
...nope, but they developed a new microarchitecture!
"Arm Zeus Cores based on Neoverse V1 IP"
...nope, they took one ready-made :/
Bwaze"With a 3.0 GHz typical and 3.5+ GHz turbo frequency, the Aum HPC processor is rated for a TDP of 300 Watts. It is capable of producing 4.6+ TeraFLOPS per socket."
Isn't that a bit much? AMD Epyc Milan (2021) only comes half way there, with similar TDP, AMD 7763 - 64C/128T 2.45G 280W. Or am I looking at different benchmarks (it isn't specified what are both of them measuring)?
It probably heavily depends on clocks used in calculation, so it checks out with about 3.1 GHz. Both Aum and Milan (and Genoa) have 2x256 SIMD pipes, giving theoretical 8 FMACs per cycle, so 16 FLOPC.
EPYC 7763, 280 W, 64 cores @2450 MHz base, 3400 turbo, 2.5-3.4 TFLOPS per socket.
For current HPC, probably more fitting comparison is either:
Threadripper Pro 5995WX, 280 W, 64 cores @2700-4500 MHz, 2.7-4.6 TFLOPS
or
Genoa EPYC 9654, 360 W, 96 cores @2400-3700 MHz, 3.6-5.6 TFLOPS per socket.

Edit:
www.anandtech.com/print/16640/arm-announces-neoverse-v1-n2-platforms-cpus-cmn700-mesh
I'd probably still bet on Genoa as more flexible with fused 2x256 in AVX-512 for front-end optimization or 4x256 (or 2x512) for separate FADD/FMUL.

Edit 2:
Andrei was then (2021) commenting on ARM's V1 core performance predictions as "extremely optimistic", and that was at 2.7 GHz, so...
Posted on Reply
#3
ScaLibBDP
@Bwaze
>>...Isn't that a bit much?..

The estimated Peak Processing Power ( PPP ) number, that is ~4.6 TFLOPs Single Precision ( SP ) operations of a Floating Point Unit ( also known as Rpeak ), is correct. It is calculated as follows:

3.0 ( GHz ) * 96 ( Number of Cores ) * 8 ( Vector Length: 256-bit / 32-bit for SP ) * 2 ( Number of FPU operations in 1 CPU clock ) = 4608 GFLOPS ~= 4.6 TFLOPs ( SP )

If India's exascale supercomputer will be without GPU accelerators then it will need ~217014 96-core processors to get 1.0 EFLOPs SP performance. It is calculated as follows:

1000000 / 4.6 ~= 217014.

A total power consumption will be more than 65.1 MW.

@Lianna
>>...It probably heavily depends on clocks used in calculation...

Peak Processing Power ( PPP ) of a CPU is a function of 4 parameters.

Take a look at my previous post.
Posted on Reply
#4
Lianna
ScaLibBDP@Lianna
>>...It probably heavily depends on clocks used in calculation...

Peak Processing Power ( PPP ) of a CPU is a function of 4 parameters.

Take a look at my previous post.
Was "so it checks out with about 3.1 GHz" too complicated? TLDR?
I was stressing the fact that Aum's info mentions "3.0 to 3.5+" GHz (vs Arm's projected 2.7).
This may be understood as 3.0 base and 3.5 turbo, or a possible base speeds in that range.
Bwaze's quoted chart was showing pessimistic/conservative TFLOPS for EPYCs, counted at base speed, while real world clocks for e.g. Genoa differ (in plus) by a factor of up to ~1.54x.
Posted on Reply
Jan 18th, 2025 07:54 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts