Latest&nbsp;GPU Drivers

Apr 3rd, 2025 11:41 Discuss (1 Comment)

Customers evaluating AI infrastructure today rely on a combination of industry-standard benchmarks and real-world model performance metrics—such as those from Llama 3.1 405B, DeepSeek-R1, and other leading open-source models—to guide their GPU purchase decisions. At AMD, we believe that delivering value across both dimensions is essential to driving broader AI adoption and real-world deployment at scale. That's why we take a holistic approach—optimizing performance for rigorous industry benchmarks like MLperf while also enabling Day 0 support and rapid tuning for the models most widely used in production by our customers.

This strategy helps ensure AMD Instinct GPUs deliver not only strong, standardized performance, but also high-throughput, scalable AI inferencing across the latest generative and language models used by customers. We will explore how AMD's continued investment in benchmarking, open model enablement, software and ecosystem tools helps unlock greater value for customers—from MLPerf Inference 5.0 results to Llama 3.1 405B and DeepSeek-R1 performance, ROCm software advances, and beyond.

NVIDIA H20 AI GPU at Risk in China, Due to Revised Energy-efficiency Guidelines & Supply Problems

Mar 27th, 2025 14:22 Discuss (3 Comments)

NVIDIA's supply of Chinese market-exclusive H20 AI GPU faces an uncertain future, due to recently introduced energy-efficiency guidelines. As covered over a year ago, Team Green readied a regional alternative to its "full fat" H800 "Hopper" AI GPU—designed and/or neutered to comply with US sanctions. Despite being less performant than Western siblings, the H20 model proved to be highly popular by mid-2024—industry analysis projected "$12 billion in take-home revenue" for NVIDIA. According to a fresh Reuters news piece, demand for cut-down "Hopper" hardware has surged throughout early 2025. The report cites "a rush to adopt Chinese AI startup DeepSeek's cost-effective AI models" as the main cause behind an increased snap up rate of H20 chips; with the nation's "big three" AI players—Tencent, Alibaba and ByteDance—driving the majority of sales.

The supply of H20 AI GPUs seems to be under threat on several fronts; Reuters points out that "U.S. officials were considering curbs on sales of H20 chips to China" back in January. Returning to the present day, their report sources "unofficial" statements from H3C—one of China's largest server equipment manufacturers and a key OEM partner for NVIDIA. An anonymous company insider outlined a murky outlook: "H20's international supply chain faces significant uncertainties...We were told the chips would be available, but when it came time to actually purchase them, we were informed they had already been sold at higher prices." More (rumored) bad news has arrived in the shape of alleged Chinese government intervention—the Financial Times posits that local regulators have privately advised that Tencent, Alibaba and ByteDance not purchase NVIDIA H20 chips.

LG Develops Custom "EXAONE Deep" State of The Art Reasoning AI Model

Mar 19th, 2025 14:04 Discuss (2 Comments)

LG AI Research, yes, that LG, which develops consumer electronics, has launched EXAONE Deep, a high-performance reasoning AI that demonstrates exceptional capabilities in mathematical logic, scientific concepts, and programming challenges despite its relatively small parameter count. The flagship 32B model achieves performance metrics rivaling substantially larger models like GPT-4o and DeepSeek R1. In contrast, the 7.8B and 2.4B variants establish new benchmarks in the lightweight and on-device AI categories. The EXAONE Deep 32B model registered a 94.5 score on the CSAT 2025 Mathematics section and 90.0 on the AIME 2024, outperforming competing models while requiring only 5% of the computational resources of larger alternatives like DeepSeek-R1 (671B).

In scientific reasoning, it achieved a 66.1 score on the GPQA Diamond test, which evaluates PhD-level problem-solving capabilities across physics, chemistry, and biology. The model's 83.0 score on MMLU establishes it as the highest-performing domestically developed model in South Korea, showing LG's approach to creating efficient, high-performance AI systems. Particularly notable is the performance of the smaller variants: the 7.8B model scored 94.8 in MATH-500 and 59.6 in AIME 2025, while the 2.4B model achieved 92.3 in MATH-500 and 47.9 in AIME 2024. These results position EXAONE Deep's smaller models at the top of their respective categories across all major benchmarks, suggesting significant potential for deployment in resource-constrained environments. With sizes of up to 32 billion parameters, it will excel in single-GPU deployments. Interestingly, these models can run on a range of discrete GPUs, laptop GPUs, and some edge systems that don't have massive computational power.

Intel China Presentation Slide Indicates Early 2026 Volume Launch of Panther Lake

Mar 17th, 2025 11:12 Discuss (17 Comments)

According to an attendee of a recent Intel AI presentation, company representatives revealed a release timeline for next-gen Core Ultra "Panther Lake" mobile processor family. Team Blue's China office appears to be courting users of DeepSeek R1, as evidenced by meng59739449's sharing of a processor product roadmap (machine translated by VideoCardz). A volume launch of Core Ultra 300 "Panther Lake-H" series seems to be on the cards for Q1 2026. Earlier this month, an Intel executive insisted that Panther Lake was on track for a second half of 2025 roll out. Lately, industry moles have alleged that a "problematic 18A node" process has caused delays across new generation product lines.

Team Blue watchdogs reckon that high volume manufacturing (HVM) of Panther Lake chips will kick off in September. By October, an Early Enablement Program (EEP) is expected to start—with samples sent off to OEMs for full approval. Industry experts believe that Intel will following a familiar pattern of "announcing the processor in the second half of the prior year, but ramping up mass production in the following year." Previous generation mobile CPU platforms—Meteor Lake and Lunar Lake—received similar treatment in the recent past. Last week, a Panther Lake-H (PTL-H) sample was on general display at Embedded World 2025—the German office is similarly engaged in hyping up the AI-crunching capabilities of roadmapped products.

Global Top 10 IC Design Houses See 49% YoY Growth in 2024, NVIDIA Commands Half the Market

Press Release by

TheLostSwede

Mar 17th, 2025 05:42 Discuss (20 Comments)

TrendForce reveals that the combined revenue of the world's top 10 IC design houses reached approximately US$249.8 billion in 2024, marking a 49% YoY increase. The booming AI industry has fueled growth across the semiconductor sector, with NVIDIA leading the charge, posting an astonishing 125% revenue growth, widening its lead over competitors, and solidifying its dominance in the IC industry.

Looking ahead to 2025, advancements in semiconductor manufacturing will further enhance AI computing power, with LLMs continuing to emerge. Open-source models like DeepSeek could lower AI adoption costs, accelerating AI penetration from servers to personal devices. This shift positions edge AI devices as the next major growth driver for the semiconductor industry.

AMD & Nexa AI Reveal NexaQuant's Improvement of DeepSeek R1 Distill 4-bit Capabilities

Press Release by

Feb 18th, 2025 10:01 Discuss (3 Comments)

Nexa AI, today, announced NexaQuants of two DeepSeek R1 Distills: The DeepSeek R1 Distill Qwen 1.5B and DeepSeek R1 Distill Llama 8B. Popular quantization methods like the llama.cpp based Q4 K M allow large language models to significantly reduce their memory footprint and typically offer low perplexity loss for dense models as a tradeoff. However, even low perplexity loss can result in a reasoning capability hit for (dense or MoE) models that use Chain of Thought traces. Nexa AI has stated that NexaQuants are able to recover this reasoning capability loss (compared to the full 16-bit precision) while keeping the 4-bit quantization and all the while retaining the performance advantage. Benchmarks provided by Nexa AI can be seen below.

We can see that the Q4 K M quantized DeepSeek R1 distills score slightly less (except for the AIME24 bench on Llama 3 8b distill, which scores significantly lower) in LLM benchmarks like GPQA and AIME24 compared to their full 16-bit counter parts. Moving to a Q6 or Q8 quantization would be one way to fix this problem - but would result in the model becoming slightly slower to run and requiring more memory. Nexa AI has stated that NexaQuants use a proprietary quantization method to recover the loss while keeping the quantization at 4-bits. This means users can theoretically get the best of both worlds: accuracy and speed.

Supplier Production Cuts and AI Demand Expected to Drive NAND Flash Price Recovery in 2H25

Press Release by

TheLostSwede

Feb 17th, 2025 08:00 Discuss (3 Comments)

TrendForce's latest findings reveal that the NAND Flash market continues to be plagued by oversupply in the first quarter of 2025, leading to sustained price declines and financial strain for suppliers. However, TrendForce anticipates a significant improvement in the market's supply-demand balance in the second half of the year.

Key factors contributing to this shift include proactive production cuts by manufacturers, inventory reductions in the smartphone sector, and growing demand driven by AI and DeepSeek applications. These elements are expected to alleviate oversupply and support a price rebound for NAND Flash.

DeepSeek Reportedly Pursuing Development of Proprietary AI Chip

Feb 14th, 2025 14:37 Discuss (7 Comments)

The notion of designing tailor-made AI-crunching chips is nothing new; several major organizations—with access to big coffers—are engaged in the formulation of proprietary hardware. A new DigiTimes Asia report suggests that DeepSeek is the latest company to jump on the in-house design bandwagon. The publication's insider network believes that the open-source large language model development house has: "initiated a major recruitment drive for semiconductor design talent, signaling potential plans to develop its proprietary processors." The recent news cycle has highlighted DeepSeek's deep reliance on an NVIDIA ecosystem, despite alternative options emerging from local sources.

Industry watchdogs believe that DeepSeek has access to 10,000 of sanction-approved Team Green "Hopper" H800 AI chips, and (now banned) 10,000 H100 AI GPUs. Around late January, DeepSeek's Scale AI CEO—Alexandr Wang—claimed that the organization could utilize up to 50,000 H100 chips for model training purposes. This unsubstantiated declaration raised eyebrows; given current global political tensions. Press outlets have speculated that DeepSeek is in no rush to reveal its full deck of cards, but they appear to have a competitive volume of resources; when lined up against with Western competitors. The DigiTimes news article did not provide any detailed insight into the rumored in-house chip design. DeepSeek faces a major challenge; the Chinese semiconductor industry trails behind market leading regions. Will local foundries be able to provide an advanced enough node process for required purposes?

Moore Threads Teases Excellent Performance of DeepSeek-R1 Model on MTT GPUs

Feb 6th, 2025 11:36 Discuss (4 Comments)

Moore Threads, a Chinese manufacturer of proprietary GPU designs is (reportedly) the latest company to jump onto the DeepSeek-R1 bandwagon. Since late January, NVIDIA, Microsoft and AMD have swooped in with their own interpretations/deployments. By global standards, Moore Threads GPUs trail behind Western-developed offerings—early 2024 evaluations presented the firm's MTT S80 dedicated desktop graphics card struggling against an AMD integrated solution: Radeon 760M. The recent emergence of DeepSeek's open source models has signalled a shift away from reliance on extremely powerful and expensive AI-crunching hardware (often accessed via the cloud)—widespread excitement has been generated by DeepSeek solutions being relatively frugal, in terms of processing requirements. Tom's Hardware has observed cases of open source AI models running (locally) on: "inexpensive hardware, like the Raspberry Pi."

According to recent Chinese press coverage, Moore Threads has announced a successful deployment of DeepSeek's R1-Distill-Qwen-7B distilled model on the aforementioned MTT S80 GPU. The company also revealed that it had taken similar steps with its MTT S4000 datacenter-oriented graphics hardware. On the subject of adaptation, a Moore Threads spokesperson stated: "based on the Ollama open source framework, Moore Threads completed the deployment of the DeepSeek-R1-Distill-Qwen-7B distillation model and demonstrated excellent performance in a variety of Chinese tasks, verifying the versatility and CUDA compatibility of Moore Threads' self-developed full-featured GPU." Exact performance figures, benchmark results and technical details were not disclosed to the Chinese public, so Moore Threads appears to be teasing the prowess of its MTT GPU designs. ITHome reported that: "users can also perform inference deployment of the DeepSeek-R1 distillation model based on MTT S80 and MTT S4000. Some users have previously completed the practice manually on MTT S80." Moore Threads believes that its: "self-developed high-performance inference engine, combined with software and hardware co-optimization technology, significantly improves the model's computing efficiency and resource utilization through customized operator acceleration and memory management. This engine not only supports the efficient operation of the DeepSeek distillation model, but also provides technical support for the deployment of more large-scale models in the future."

Huawei Delivers Record $118 Billion Revenue with 22% Yearly Growth Despite US Sanctions

Feb 6th, 2025 02:51 Discuss (25 Comments)

Huawei Technologies reported a robust 22% year-over-year revenue increase for 2024, reaching 860 billion yuan ($118.27 billion), demonstrating remarkable resilience amid continued US-imposed trade restrictions. The Chinese tech giant's resurgence was primarily driven by its revitalized smartphone division, which captured 16% of China's domestic market share, overtaking Apple in regional sales. This achievement was notably accomplished by deploying domestically produced chipsets, marking a significant milestone for the company. In collaboration with Chinese SMIC, Huawei delivers in-house silicon solutions to integrate with HarmonyOS for complete vertical integration. The company's strategic diversification into automotive technology has emerged as a crucial growth vector, with its smart car solutions unit delivering autonomous driving software and specialized chips to Chinese EV manufacturers.

In parallel, Huawei's Ascend AI 910B/C platform recently announced compatibility with DeepSeek's R1 large language model and announced availability on Chinese AI cloud providers like SiliconFlow. Through a strategic partnership with AI infrastructure startup SiliconFlow, Huawei is enhancing its Ascend cloud service capabilities, further strengthening its competitive position in the global AI hardware market despite ongoing international trade challenges. Even if the company can't compete on performance versus the latest solutions from NVIDIA and AMD due to the lack of advanced manufacturing required for AI accelerators, it can compete on costs and deliver solutions that are typically much more competitive with the price/performance ratio. Huawei's Ascend AI solutions deliver modest performance. Still, the pricing makes AI model inference very cheap, with API costs of around one Yaun per million input tokens and four Yuan per one million output tokens on DeepSeek R1.

AMD Faces Investor Skepticism as AI Market Moves Toward Custom Chips

Nomad76

Feb 3rd, 2025 13:18 Discuss (21 Comments)

AMD is set to share its fourth-quarter results on Tuesday, Feb. 4 facing opportunities and problems in the fast-changing AI chip market as investors are expected to look closely at AMD's AI strategy. Reuters reports that experts think AMD's revenue will increase by over 22% to $7.53 billion. They expect its data center part to make up more than half of total sales at $4.15 billion. Yet, investors still worry about how AMD stands in the AI race. TD Cowen experts and Omdia believe AMD could sell $10 billion worth of AI chips this year, this is twice what AMD itself thinks it will sell, which is $5 billion. However, the scene is getting more complex with Big Tech firms like Microsoft, Amazon, and Meta making their own special chips for AI work. This move to custom chips, along with NVIDIA's strong market position and its popular CUDA software, makes things tough for AMD. The high costs of switching chipmakers also make it hard for AMD to grow its share of the market, however, the ongoing increase in AI spending by tech giants could help balance out these problems. Investors see "customer silicon and NVIDIA as the AI chip market going forward," said Ryuta Makino, analyst at AMD investor Gabelli Funds.

Supply chain issues make AMD's position more difficult as TSMC is boosting its advanced packaging ability to fix bottlenecks, while NVIDIA's production increase of its new "Blackwell" AI chips might restrict AMD's access to manufacturing resources. Yet, AMD's business has some good news, its personal computer unit should grow by almost 33% to $1.94 billion catching up to Intel.

NVIDIA GeForce RTX 50 Series AI PCs Accelerate DeepSeek Reasoning Models

Press Release by

Feb 3rd, 2025 08:54 Discuss (24 Comments)

The recently released DeepSeek-R1 model family has brought a new wave of excitement to the AI community, allowing enthusiasts and developers to run state-of-the-art reasoning models with problem-solving, math and code capabilities, all from the privacy of local PCs. With up to 3,352 trillion operations per second of AI horsepower, NVIDIA GeForce RTX 50 Series GPUs can run the DeepSeek family of distilled models faster than anything on the PC market.

A New Class of Models That Reason
Reasoning models are a new class of large language models (LLMs) that spend more time on "thinking" and "reflecting" to work through complex problems, while describing the steps required to solve a task. The fundamental principle is that any problem can be solved with deep thought, reasoning and time, just like how humans tackle problems. By spending more time—and thus compute—on a problem, the LLM can yield better results. This phenomenon is known as test-time scaling, where a model dynamically allocates compute resources during inference to reason through problems. Reasoning models can enhance user experiences on PCs by deeply understanding a user's needs, taking actions on their behalf and allowing them to provide feedback on the model's thought process—unlocking agentic workflows for solving complex, multi-step tasks such as analyzing market research, performing complicated math problems, debugging code and more.

Huawei Ascend 910B Accelerators Power Cloud Infrastructure for DeepSeek R1 Inference

Feb 3rd, 2025 04:41 Discuss (15 Comments)

When High-Flyer, the hedge fund behind DeepSeek, debuted its flagship model, DeepSeek R1, the tech world went downward. No one expected Chinese AI companies can produce high-quality AI model that rivals the best from OpenAI and Anthropic. While there are rumors that DeepSeek has access to 50,000 NVIDIA "Hopper" GPUs, including H100, H800, and H20, it seems like Huawei is ready to power Chinese AI infrastructure with its AI accelerators. According to the South China Morning Post, Chinese cloud providers like SiliconFlow.cn are offering DeepSeek AI models for inference on Huawei Ascend 910B accelerators. For the price of only one Yuan for one million input tokens, and four Yuan for one million output tokens, this economic model of AI hosting is fundamentally undercutting competition like US-based cloud providers that offer DeepSeek R1 for $7 per million tokens.

Not only is running on the Huawei Ascend 910B cheaper for cloud providers, but we also reported that it is cheaper for DeepSeek itself, which serves its chat app on the Huawei Ascend 910C. Using domestic accelerators lowers the total cost of ownership, with savings passed down to users. If Western clients prefer AI inference to be served by Western companies, they will have to pay a heftier price tag, often backed by the high prices of GPUs like NVIDIA H100, B100, and AMD Instinct MI300X.

US Investigates Possible "Singapore" Loophole in China's Access to NVIDIA GPUs

Jan 31st, 2025 13:50 Discuss (36 Comments)

Today, Bloomberg reported that the US government under Trump administration is probing whether Chinese AI company DeepSeek circumvented export restrictions to acquire advanced NVIDIA GPUs through Singaporean intermediaries. The investigation follows concerns that DeepSeek's AI model, R1—reportedly rivaling leading systems from OpenAI and Google—may have been trained using restricted hardware that is blocked from exporting to China. Singapore's role in NVIDIA's global sales has surged, with the nation accounting for 22% of the chipmaker's revenue in Q3 FY2025, up from 9% in Q3 FY2023. This spike coincides with tightened US export controls on AI chips to China, prompting speculation that Singapore serves as a pipe for Chinese firms to access high-end GPUs like the H100, which cannot be sold directly to China.

DeepSeek has not disclosed hardware details for R1 but revealed its earlier V3 model was trained using 2,048 H800 GPUs (2.8 million GPU hours), achieving efficiency surpassing Meta's Llama 3, which required 30.8 million GPU hours. Analysts suggest R1's performance implies even more powerful infrastructure, potentially involving restricted chips. US authorities, including the White House and FBI, are examining whether third parties in Singapore facilitated the transfer of controlled GPUs to DeepSeek. A well-known semiconductor analyst firm, SemiAnalysis, believes that DeepSeek acquired around 50,000 NVIDIA Hopper GPUs, which includes a mix of H100, H800, and H20. NVIDIA clarified that its reported Singapore revenue reflects "bill to" customer locations, not final destinations, stating most products are routed to the US or Western markets.

DeepSeek-R1 Goes Live on NVIDIA NIM

Press Release by

Jan 31st, 2025 10:59 Discuss (9 Comments)

DeepSeek-R1 is an open model with state-of-the-art reasoning capabilities. Instead of offering direct responses, reasoning models like DeepSeek-R1 perform multiple inference passes over a query, conducting chain-of-thought, consensus and search methods to generate the best answer. Performing this sequence of inference passes—using reason to arrive at the best answer—is known as test-time scaling. DeepSeek-R1 is a perfect example of this scaling law, demonstrating why accelerated computing is critical for the demands of agentic AI inference.

As models are allowed to iteratively "think" through the problem, they create more output tokens and longer generation cycles, so model quality continues to scale. Significant test-time compute is critical to enable both real-time inference and higher-quality responses from reasoning models like DeepSeek-R1, requiring larger inference deployments. R1 delivers leading accuracy for tasks demanding logical inference, reasoning, math, coding and language understanding while also delivering high inference efficiency.

Microsoft Enables Distilled DeepSeek R1 Models on Copilot+ PCs, Starting with Qualcomm Snapdragon X

Jan 30th, 2025 10:46 Discuss (0 Comments)

Microsoft will roll out "NPU-optimized versions of DeepSeek-R1" directly to Copilot+ PCs—yesterday's announcement revealed that Qualcomm Snapdragon X-equipped systems will be first in line to receive support. Owners of devices—that harbor Intel Core Ultra 200V "Lunar Lake" processors—will have to wait a little longer, and reports suggest that AMD Ryzen AI 9 HX-based Copilot+ PCs will be third in Microsoft's queue. Interestingly, Team Red has published DeepSeek R1 model-related guides for Radeon RX graphics cards and Ryzen AI processors. Starting off, Microsoft's first release will be based on DeepSeek-R1-Distill-Qwen-1.5B—made available in AI Toolkit. Teased future updates will be 7B and 14B variants. These are expected to "arrive soon."

Microsoft reckons that the optimized models will: "let developers build and deploy AI-powered applications that run efficiently on-device, taking full advantage of the powerful Neural Processing Unit (NPUs) in Copilot+ PCs. The on-board AI-crunching solution is advertised as a tool for empowerment—allowing: "developers to tap into powerful reasoning engines to build proactive and sustained experiences. With our work on Phi Silica, we were able to harness highly efficient inferencing—delivering very competitive time to first token and throughput rates, while minimally impacting battery life and consumption of PC resources." Western companies appear to be participating in a race to swiftly adopt DeepSeek's open source model, due to apparent cost benefits. Certain North American organizations have disclosed their own views and reservations, but others will happily pay less for a potent alternative to locally-developed systems. In a separate bulletin (also posted on January 29), Microsoft's AI platform team revealed that a cloud-hosted DeepSeek R1 model is available on Azure AI Foundry and GitHub.

AMD Details DeepSeek R1 Performance on Radeon RX 7900 XTX, Confirms Ryzen AI Max Memory Sizes

btarunr

Jan 29th, 2025 10:33 Discuss (28 Comments)

AMD today put out detailed guides on how to get DeepSeek R1 distilled reasoning models to run on Radeon RX graphics cards and Ryzen AI processors. The guide confirms that the new Ryzen AI Max "Strix Halo" processors come in hardwired to LPCAMM2 memory configurations of 32 GB, 64 GB, and 128 GB, and there won't be a 16 GB memory option for notebook manufacturers to cheap out with. The guide goes on to explain that "Strix Halo" will be able to locally accelerate DeepSeek-R1-Distill-Llama with 70 billion parameters on the 64 GB and 128 GB memory configurations of "Strix Halo" powered notebooks, while the 32 GB model should be able to run DeepSeek-R1-Distill-Qwen-32B. Ryzen AI "Strix Point" mobile processors should be capable of running DeepSeek-R1-Distill-Qwen-14B and DeepSeek-R1-Distill-Llama-14B on their RDNA 3.5 iGPUs and NPUs. Meanwhile, older generation processors based on "Phoenix Point" and "Hawk Point" chips should be capable of DeepSeek-R1-Distill-Llama-14B. The company recommends running all of the above distills in Q4 K M quantization.

Switching gears to the discrete graphics cards, and AMD is only recommending its Radeon RX 7000 series for now, since the RDNA 3 graphics architecture introduces AI accelerators. The flagship Radeon RX 7900 XTX is recommended for DeepSeek-R1-Distill-Qwen-32B distill, while all SKUs with 12 GB to 20 GB of memory—that's RX 7600 XT, RX 7700 XT, RX 7800 XT, RX 7900 GRE, and RX 7900 XT, are recommended till DeepSeek-R1-Distill-Qwen-14B. The mainstream RX 7600 with its 8 GB memory is only recommended till DeepSeek-R1-Distill-Llama-8B. You will need LM Studio 0.3.8 or later and Radeon Software Adrenalin 25.1.1 beta or later drivers. AMD put out first party LMStudio 0.3.8 tokens/second performance numbers for the RX 7900 XTX, comparing it with the NVIDIA GeForce RTX 4080 SUPER and the RTX 4090.