Tuesday, December 19th 2023

Moore Threads Launches MTT S4000 48 GB GPU for AI Training/Inference and Presents 1000-GPU Cluster

Chinese chipmaker Moore Threads has launched its first domestically-produced 1000-card AI training cluster, dubbed the KUAE Intelligent Computing Center. A central part of the KUAE cluster is Moore Threads new MTT S4000 accelerator card with 48 GB VRAM utilizing the company's third-generation MUSA GPU architecture and 768 GB/s memory bandwidth. In FP32, the card can output 25 TeraFLOPS; in TF32, it can achieve 50 TeraFLOPS; and in FP16/BF16, up to 200 TeraFLOPS. Also supported is INT8 at 200 TOPS. The MTT S4000 focuses on both training and inference, leveraging Moore Thread's high-speed MTLink 1.0 intra-system interconnect to scale cards for distributed model parallel training of datasets with hundreds of billions of parameters. The card also provides graphics, video encoding/decoding, and 8K display capabilities for graphics workloads. Moore Thread's KUAE cluster combines the S4000 GPU hardware with RDMA networking, distributed storage, and integrated cluster management software. The KUAE Platform oversees multi-datacenter resource allocation and monitoring. KUAE ModelStudio hosts training frameworks and model repositories to streamline development.

With integrated solutions now proven at thousands of GPUs, Moore Thread is positioned to power ubiquitous intelligent applications - from scientific computing to the metaverse. The KUAE cluster reportedly achieves near-linear 91% scaling. Taking 200 billion training data as an example, Zhiyuan Research Institute's 70 billion parameter Aquila2 can complete training in 33 days; a model with 130 billion parameters can complete training in 56 days on the KUAE cluster. In addition, the Moore Threads KUAE killocard cluster supports long-term continuous and stable operation, supports breakpoint resume training, and has an asynchronous checkpoint that is less than 2 minutes. For software, Moore Threads also boasts full compatibility with NVIDIA's CUDA framework, where its MUSIFY tool translates CUDA code to MUSA GPU architecture at supposedly zero cost of migration, i.e., no performance penalty.
Sources: Moore Threads, via VideoCardz
Add your own comment

3 Comments on Moore Threads Launches MTT S4000 48 GB GPU for AI Training/Inference and Presents 1000-GPU Cluster

#1
xrli
Not a lot of FP16 Flops for today's standard and paired with rather slow GDDR6 memory... Huawei's Ascend 910 can do 256T FP16 Flops I think and uses HBM2. It was also launched 4 years ago. Also these cards have display outputs? I guess they are identical silicon to their consumer chips. Full compatibility with CUDA sounds almost too good to be true. AMD spent years working on HIP and it has just became acceptable recently, But if it's the case then that will be a huge plus for Moore Threads.
Posted on Reply
#2
john_
MUSA that it is not CUDA and RDMA that is not RDNA.

It must be taking them weeks to come up with these names. So original....
Posted on Reply
Dec 20th, 2024 17:31 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts