News Posts matching #MTT S4000

Return to Keyword Browsing

Moore Threads Teases Excellent Performance of DeepSeek-R1 Model on MTT GPUs

Moore Threads, a Chinese manufacturer of proprietary GPU designs is (reportedly) the latest company to jump onto the DeepSeek-R1 bandwagon. Since late January, NVIDIA, Microsoft and AMD have swooped in with their own interpretations/deployments. By global standards, Moore Threads GPUs trail behind Western-developed offerings—early 2024 evaluations presented the firm's MTT S80 dedicated desktop graphics card struggling against an AMD integrated solution: Radeon 760M. The recent emergence of DeepSeek's open source models has signalled a shift away from reliance on extremely powerful and expensive AI-crunching hardware (often accessed via the cloud)—widespread excitement has been generated by DeepSeek solutions being relatively frugal, in terms of processing requirements. Tom's Hardware has observed cases of open source AI models running (locally) on: "inexpensive hardware, like the Raspberry Pi."

According to recent Chinese press coverage, Moore Threads has announced a successful deployment of DeepSeek's R1-Distill-Qwen-7B distilled model on the aforementioned MTT S80 GPU. The company also revealed that it had taken similar steps with its MTT S4000 datacenter-oriented graphics hardware. On the subject of adaptation, a Moore Threads spokesperson stated: "based on the Ollama open source framework, Moore Threads completed the deployment of the DeepSeek-R1-Distill-Qwen-7B distillation model and demonstrated excellent performance in a variety of Chinese tasks, verifying the versatility and CUDA compatibility of Moore Threads' self-developed full-featured GPU." Exact performance figures, benchmark results and technical details were not disclosed to the Chinese public, so Moore Threads appears to be teasing the prowess of its MTT GPU designs. ITHome reported that: "users can also perform inference deployment of the DeepSeek-R1 distillation model based on MTT S80 and MTT S4000. Some users have previously completed the practice manually on MTT S80." Moore Threads believes that its: "self-developed high-performance inference engine, combined with software and hardware co-optimization technology, significantly improves the model's computing efficiency and resource utilization through customized operator acceleration and memory management. This engine not only supports the efficient operation of the DeepSeek distillation model, but also provides technical support for the deployment of more large-scale models in the future."

Moore Threads MTLink Scales Up to 10,000 Home-Grown GPUs in AI Cluster

Chinese GPU manufacturer Moore Threads has announced a significant upgrade to its KUAE data center server. The company now has the ability to connect up to 10,000 GPUs in a single cluster, marking a huge leap in its scale-out capabilities for artificial intelligence and high-performance computing applications. The enhanced KUAE server incorporates eight MTT S4000 GPUs, leveraging Moore Threads' proprietary MTLink interconnect technology. These GPUs, based on the MUSA architecture, each feature 128 tensor cores and 48 GB of GDDR6 memory, delivering a bandwidth of 768 GB/s. While the full performance metrics of a 10,000-GPU cluster remain undisclosed, the sheer scale of 1,280,000 tensor cores suggests decent computing potential. Moore Threads' GPUs currently lag behind NVIDIA's GPU offerings in terms of performance. However, the company claims its MTT S4000 remains competitive against certain NVIDIA models, particularly in large language model training and inference tasks.

The Chinese company is facing significant challenges due to its inclusion on the U.S. Department of Commerce's Entity List, restricting access to advanced manufacturing processes. Despite these obstacles, the firm has secured partnerships with major Chinese state-run telecom operators and technology companies, focusing on developing new computing cluster projects. A recent financing round raised approximately $343.7 million will help Moore Threads' ambitious expansion plans. However, limited access to cutting-edge semiconductor fabrication technologies may constrain the company's future growth. Nonetheless, creating a scale-out server infrastructure with up to 10,000 GPUs is vital for LLM training and inference, especially as Chinese AI labs catch up to Western labs in terms of the performance of their AI models.

Moore Threads Launches MTT S4000 48 GB GPU for AI Training/Inference and Presents 1000-GPU Cluster

Chinese chipmaker Moore Threads has launched its first domestically-produced 1000-card AI training cluster, dubbed the KUAE Intelligent Computing Center. A central part of the KUAE cluster is Moore Threads new MTT S4000 accelerator card with 48 GB VRAM utilizing the company's third-generation MUSA GPU architecture and 768 GB/s memory bandwidth. In FP32, the card can output 25 TeraFLOPS; in TF32, it can achieve 50 TeraFLOPS; and in FP16/BF16, up to 200 TeraFLOPS. Also supported is INT8 at 200 TOPS. The MTT S4000 focuses on both training and inference, leveraging Moore Thread's high-speed MTLink 1.0 intra-system interconnect to scale cards for distributed model parallel training of datasets with hundreds of billions of parameters. The card also provides graphics, video encoding/decoding, and 8K display capabilities for graphics workloads. Moore Thread's KUAE cluster combines the S4000 GPU hardware with RDMA networking, distributed storage, and integrated cluster management software. The KUAE Platform oversees multi-datacenter resource allocation and monitoring. KUAE ModelStudio hosts training frameworks and model repositories to streamline development.

With integrated solutions now proven at thousands of GPUs, Moore Thread is positioned to power ubiquitous intelligent applications - from scientific computing to the metaverse. The KUAE cluster reportedly achieves near-linear 91% scaling. Taking 200 billion training data as an example, Zhiyuan Research Institute's 70 billion parameter Aquila2 can complete training in 33 days; a model with 130 billion parameters can complete training in 56 days on the KUAE cluster. In addition, the Moore Threads KUAE killocard cluster supports long-term continuous and stable operation, supports breakpoint resume training, and has an asynchronous checkpoint that is less than 2 minutes. For software, Moore Threads also boasts full compatibility with NVIDIA's CUDA framework, where its MUSIFY tool translates CUDA code to MUSA GPU architecture at supposedly zero cost of migration, i.e., no performance penalty.
Return to Keyword Browsing
Feb 22nd, 2025 12:54 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts