News Posts matching #Tensor Cores

Return to Keyword Browsing

NVIDIA Blackwell Sets New Standard for Generative AI in MLPerf Inference Benchmark

As enterprises race to adopt generative AI and bring new services to market, the demands on data center infrastructure have never been greater. Training large language models is one challenge, but delivering LLM-powered real-time services is another. In the latest round of MLPerf industry benchmarks, Inference v4.1, NVIDIA platforms delivered leading performance across all data center tests. The first-ever submission of the upcoming NVIDIA Blackwell platform revealed up to 4x more performance than the NVIDIA H100 Tensor Core GPU on MLPerf's biggest LLM workload, Llama 2 70B, thanks to its use of a second-generation Transformer Engine and FP4 Tensor Cores.

The NVIDIA H200 Tensor Core GPU delivered outstanding results on every benchmark in the data center category - including the latest addition to the benchmark, the Mixtral 8x7B mixture of experts (MoE) LLM, which features a total of 46.7 billion parameters, with 12.9 billion parameters active per token. MoE models have gained popularity as a way to bring more versatility to LLM deployments, as they're capable of answering a wide variety of questions and performing more diverse tasks in a single deployment. They're also more efficient since they only activate a few experts per inference - meaning they deliver results much faster than dense models of a similar size.

Supermicro Launches Plug-and-Play SuperCluster for NVIDIA Omniverse

Supermicro, Inc., a Total IT Solution Provider for AI, Cloud, Storage, and 5G/Edge, is announcing a new addition to its SuperCluster portfolio of plug-and-play AI infrastructure solutions for the NVIDIA Omniverse platform to deliver the high-performance generative AI-enhanced 3D workflows at enterprise scale. This new SuperCluster features the latest Supermicro NVIDIA OVX systems and allows enterprises to easily scale as workloads increase.

"Supermicro has led the industry in developing GPU-optimized products, traditionally for 3D graphics and application acceleration, and now for AI," said Charles Liang, president and CEO of Supermicro. "With the rise of AI, enterprises are seeking computing infrastructure that combines all these capabilities into a single package. Supermicro's SuperCluster features fully interconnected 4U PCIe GPU NVIDIA-Certified Systems for NVIDIA Omniverse, with up to 256 NVIDIA L40S PCIe GPUs per scalable unit. The system helps deliver high performance across the Omniverse platform, including generative AI integrations. By developing this SuperCluster for Omniverse, we're not just offering a product; we're providing a gateway to the future of application development and innovation."

New Performance Optimizations Supercharge NVIDIA RTX AI PCs for Gamers, Creators and Developers

NVIDIA today announced at Microsoft Build new AI performance optimizations and integrations for Windows that help deliver maximum performance on NVIDIA GeForce RTX AI PCs and NVIDIA RTX workstations. Large language models (LLMs) power some of the most exciting new use cases in generative AI and now run up to 3x faster with ONNX Runtime (ORT) and DirectML using the new NVIDIA R555 Game Ready Driver. ORT and DirectML are high-performance tools used to run AI models locally on Windows PCs.

WebNN, an application programming interface for web developers to deploy AI models, is now accelerated with RTX via DirectML, enabling web apps to incorporate fast, AI-powered capabilities. And PyTorch will support DirectML execution backends, enabling Windows developers to train and infer complex AI models on Windows natively. NVIDIA and Microsoft are collaborating to scale performance on RTX GPUs. These advancements build on NVIDIA's world-leading AI platform, which accelerates more than 500 applications and games on over 100 million RTX AI PCs and workstations worldwide.

NVIDIA Launches the RTX A400 and A1000 Professional Graphics Cards

AI integration across design and productivity applications is becoming the new standard, fueling demand for advanced computing performance. This means professionals and creatives will need to tap into increased compute power, regardless of the scale, complexity or scope of their projects. To meet this growing need, NVIDIA is expanding its RTX professional graphics offerings with two new NVIDIA Ampere architecture-based GPUs for desktops: the NVIDIA RTX A400 and NVIDIA RTX A1000.

They expand access to AI and ray tracing technology, equipping professionals with the tools they need to transform their daily workflows. The RTX A400 GPU introduces accelerated ray tracing and AI to the RTX 400 series GPUs. With 24 Tensor Cores for AI processing, it surpasses traditional CPU-based solutions, enabling professionals to run cutting-edge AI applications, such as intelligent chatbots and copilots, directly on their desktops. The GPU delivers real-time ray tracing, so creators can build vivid, physically accurate 3D renders that push the boundaries of creativity and realism.

Intel Launches Gaudi 3 AI Accelerator: 70% Faster Training, 50% Faster Inference Compared to NVIDIA H100, Promises Better Efficiency Too

During the Vision 2024 event, Intel announced its latest Gaudi 3 AI accelerator, promising significant improvements over its predecessor. Intel claims the Gaudi 3 offers up to 70% improvement in training performance, 50% better inference, and 40% better efficiency than Nvidia's H100 processors. The new AI accelerator is presented as a PCIe Gen 5 dual-slot add-in card with a 600 W TDP or an OAM module with 900 W. The PCIe card has the same peak 1,835 TeraFLOPS of FP8 performance as the OAM module despite a 300 W lower TDP. The PCIe version works as a group of four per system, while the OAM HL-325L modules can be run in an eight-accelerator configuration per server. This likely will result in a lower sustained performance, given the lower TDP, but it confirms that the same silicon is used, just finetuned with a lower frequency. Built on TSMC's N5 5 nm node, the AI accelerator features 64 Tensor Cores, delivering double the FP8 and quadruple FP16 performance over the previous generation Gaudi 2.

The Gaudi 3 AI chip comes with 128 GB of HBM2E with 3.7 TB/s of bandwidth and 24 200 Gbps Ethernet NICs, with dual 400 Gbps NICs used for scale-out. All of that is laid out on 10 tiles that make up the Gaudi 3 accelerator, which you can see pictured below. There is 96 MB of SRAM split between two compute tiles, which acts as a low-level cache that bridges data communication between Tensor Cores and HBM memory. Intel also announced support for the new performance-boosting standardized MXFP4 data format and is developing an AI NIC ASIC for Ultra Ethernet Consortium-compliant networking. The Gaudi 3 supports clusters of up to 8192 cards, coming from 1024 nodes comprised of systems with eight accelerators. It is on track for volume production in Q3, offering a cost-effective alternative to NVIDIA accelerators with the additional promise of a more open ecosystem. More information and a deeper dive can be found in the Gaudi 3 Whitepaper.

AAEON BOXER-8653AI & BOXER-8623AI Expand Vertical Market Potential in a More Compact Form

Leading provider of embedded PC solutions, AAEON, is delighted to announce the official launch of two new additions to its rich line of embedded AI systems, the BOXER-8653AI and BOXER-8623AI, which are powered by the NVIDIA Jetson Orin NX and Jetson Orin Nano, respectively. Measuring just 180 mm x 136 mm x 75 mm, both systems are compact and easily wall-mounted for discreet deployment, which AAEON indicate make them ideal for use in both indoor and outdoor settings such as factories and parking lots. Adding to this is the systems' environmental resilience, with the BOXER-8653AI sporting a wide -15°C to 60°C temperature tolerance and the BOXER-8623AI able to operate between -15°C and 65°C, with both supporting a 12 V ~ 24 V power input range via a 2-pin terminal block.

The BOXER-8653AI benefits from the NVIDIA Jetson Orin NX module, offering up to 70 TOPS of AI inference performance for applications that require extremely fast analysis of vast quantities of data. Meanwhile, the BOXER-8623AI utilizes the more efficient, yet still powerful NVIDIA Jetson Orin Nano module, capable of up to 40 TOPS. Both systems consequently make use of the 1024-core NVIDIA Ampere architecture GPU with 32 Tensor Cores.

NVIDIA AI GPU Customers Reportedly Selling Off Excess Hardware

The NVIDIA H100 Tensor Core GPU was last year's hot item for HPC and AI industry segments—the largest purchasers were reported to have acquired up to 150,000 units each. Demand grew so much that lead times of 36 to 52 weeks became the norm for H100-based server equipment. The latest rumblings indicate that things have stabilized—so much so that some organizations are "offloading chips" as the supply crunch cools off. Apparently it is more cost-effective to rent AI processing sessions through cloud service providers (CSPs)—the big three being Amazon Web Services, Google Cloud, and Microsoft Azure.

According to a mid-February Seeking Alpha report, wait times for the NVIDIA H100 80 GB GPU model have been reduced down to around three to four months. The Information believes that some companies have already reduced their order counts, while others have hardware sitting around, completely unused. Maintenance complexity and costs are reportedly cited as a main factors in "offloading" unneeded equipment, and turning to renting server time from CSPs. Despite improved supply conditions, AI GPU demand is still growing—driven mainly by organizations dealing with LLM models. A prime example being Open AI—as pointed out by The Information—insider murmurings have Sam Altman & Co. seeking out alternative solutions and production avenues.

NVIDIA Introduces NVIDIA RTX 2000 Ada Generation GPU

Generative AI is driving change across industries—and to take advantage of its benefits, businesses must select the right hardware to power their workflows. The new NVIDIA RTX 2000 Ada Generation GPU delivers the latest AI, graphics and compute technology to compact workstations, offering up to 1.5x the performance of the previous-generation RTX A2000 12 GB in professional workflows. From crafting stunning 3D environments to streamlining complex design reviews to refining industrial designs, the card's capabilities pave the way for an AI-accelerated future, empowering professionals to achieve more without compromising on performance or capabilities. Modern multi-application workflows, such as AI-powered tools, multi-display setups and high-resolution content, put significant demands on GPU memory. With 16 GB of memory in the RTX 2000 Ada, professionals can tap the latest technologies and tools to work faster and better with their data.

Powered by NVIDIA RTX technology, the new GPU delivers impressive realism in graphics with NVIDIA DLSS, delivering ultra-high-quality, photorealistic ray-traced images more than 3x faster than before. In addition, the RTX 2000 Ada enables an immersive experience for enterprise virtual-reality workflows, such as for product design and engineering design reviews. With its blend of performance, versatility and AI capabilities, the RTX 2000 Ada helps professionals across industries achieve efficiencies. Architects and urban planners can use it to accelerate visualization workflows and structural analysis, enhancing design precision. Product designers and engineers using industrial PCs can iterate rapidly on product designs with fast, photorealistic rendering and AI-powered generative design. Content creators can edit high-resolution videos and images seamlessly, and use AI for realistic visual effects and content creation assistance. And in vital embedded applications and edge computing, the RTX 2000 Ada can power real-time data processing for medical devices, optimize manufacturing processes with predictive maintenance and enable AI-driven intelligence in retail environments.

Mod Unlocks FSR 3 Fluid Motion Frames on Older NVIDIA GeForce RTX 20/30 Series Cards

NVIDIA's latest RTX 40 series graphics cards feature impressive new technologies like DLSS 3 that can significantly enhance performance and image quality in games. However, owners of older 20 and 30 series NVIDIA GeForce RTX cards cannot officially benefit from these cutting-edge advances. DLSS 3's Frame Generation feature, in particular, requires dedicated hardware only found in NVIDIA's brand new Ada Lovelace architecture. But the ingenious modding community has stepped in with a creative workaround solution where NVIDIA has refused to enable frame generation functionality on older generation hardware. A new third-party modification can unofficially activate both upscaling (FSR, DLAA, DLSS or XeSS) and AMD Fluid Motion Frames on older NVIDIA cards equipped with Tensor Cores. Replacing two key DLL files and a small edit to the Windows registry enables the "DLSS 3" option to be activated in games running on older hardware.

In testing conducted by Digital Foundry, this modification delivered up to a 75% FPS boost - on par with the performance uplift official DLSS 3 provides on RTX 40 series cards. Games like Cyberpunk 2077, Spider-Man: Miles Morales, and A Plague Tale: Requiem were used to benchmark performance. However, there can be minor visual flaws, including incorrect UI interpolation or random frame time fluctuations. Ironically, while the FSR 3 tech itself originates from AMD, the mod currently only works on NVIDIA cards. So, while not officially supported, the resourcefulness of the modding community has remarkably managed to bring cutting-edge frame generation to more NVIDIA owners - until AMD RDNA 3 cards can utilize it as well. This shows the incredible potential of community-driven software modification and innovation.

ASUS Announces Dual GeForce RTX 4060 Ti SSD Graphics Card

ASUS today announced the Dual GeForce RTX 4060 Ti SSD, the world's first graphics card equipped with an M.2 slot, allowing for a seamless cooling upgrade for high-performance NVMe drives.

Reimagined M.2 storage
At its core, this card has all of the same amazing features as the ASUS Dual GeForce RTX 4060 Ti 8GB. Third-generation RT Cores and fourth-generation Tensor Cores, now featuring DLSS 3.5 and frame generation, drive incredibly immersive real-time ray tracing experiences, enabling this graphics card to push the limits of how good modern games can look. Housed in a sleek 2.5-slot design that only requires a single 8-pin PCIe power connector, the Dual GeForce RTX 4060 Ti SSD can easily fit into almost any existing build.

NVIDIA Announces up to 5x Faster TensorRT-LLM for Windows, and ChatGPT API-like Interface

Even as CPU vendors are working to mainstream accelerated AI for client PCs, and Microsoft setting the pace for more AI in everyday applications with Windows 11 23H2 Update; NVIDIA is out there reminding you that every GeForce RTX GPU is an AI accelerator. This is thanks to its Tensor cores, and the SIMD muscle of the ubiquitous CUDA cores. NVIDIA has been making these for over 5 years now, and has an install base of over 100 million. The company is hence focusing on bring generative AI acceleration to more client- and enthusiast relevant use-cases, such as large language models.

NVIDIA at the Microsoft Ignite event announced new optimizations, models, and resources to bring accelerated AI to everyone with an NVIDIA GPU that meets the hardware requirements. To begin with, the company introduced an update to TensorRT-LLM for Windows, a library that leverages NVIDIA RTX architecture for accelerating large language models (LLMs). The new TensorRT-LLM version 0.6.0 will release later this month, and improve LLM inference performance by up to 5 times in terms of tokens per second, when compared to the initial release of TensorRT-LLM from October 2023. In addition, TensorRT-LLM 0.6.0 will introduce support for popular LLMs, including Mistral 7B and Nemtron-3 8B. Accelerating these two will require a GeForce RTX 30-series "Ampere" or 40-series "Ada" GPU with at least 8 GB of main memory.

Striking Performance: LLMs up to 4x Faster on GeForce RTX With TensorRT-LLM

Generative AI is one of the most important trends in the history of personal computing, bringing advancements to gaming, creativity, video, productivity, development and more. And GeForce RTX and NVIDIA RTX GPUs, which are packed with dedicated AI processors called Tensor Cores, are bringing the power of generative AI natively to more than 100 million Windows PCs and workstations.

Today, generative AI on PC is getting up to 4x faster via TensorRT-LLM for Windows, an open-source library that accelerates inference performance for the latest AI large language models, like Llama 2 and Code Llama. This follows the announcement of TensorRT-LLM for data centers last month. NVIDIA has also released tools to help developers accelerate their LLMs, including scripts that optimize custom models with TensorRT-LLM, TensorRT-optimized open-source models and a developer reference project that showcases both the speed and quality of LLM responses.

Dell Technologies Expands Generative AI Portfolio

Dell Technologies expands its Dell Generative AI Solutions portfolio, helping businesses transform how they work along every step of their generative AI (GenAI) journeys. "To maximize AI efforts and support workloads across public clouds, on-premises environments and at the edge, companies need a robust data foundation with the right infrastructure, software and services," said Jeff Boudreau, chief AI officer, Dell Technologies. "That's what we are building with our expanded validated designs, professional services, modern data lakehouse and the world's broadest GenAI solutions portfolio."

Customizing GenAI models to maximize proprietary data
The Dell Validated Design for Generative AI with NVIDIA for Model Customization offers pre-trained models that extract intelligence from data without building models from scratch. This solution provides best practices for customizing and fine-tuning GenAI models based on desired outcomes while helping keep information secure and on-premises. With a scalable blueprint for customization, organizations now have multiple ways to tailor GenAI models to accomplish specific tasks with their proprietary data. Its modular and flexible design supports a wide range of computational requirements and use cases, spanning training diffusion, transfer learning and prompt tuning.

NVIDIA H100 GPUs Now Available on AWS Cloud

AWS users can now access the leading performance demonstrated in industry benchmarks of AI training and inference. The cloud giant officially switched on a new Amazon EC2 P5 instance powered by NVIDIA H100 Tensor Core GPUs. The service lets users scale generative AI, high performance computing (HPC) and other applications with a click from a browser.

The news comes in the wake of AI's iPhone moment. Developers and researchers are using large language models (LLMs) to uncover new applications for AI almost daily. Bringing these new use cases to market requires the efficiency of accelerated computing. The NVIDIA H100 GPU delivers supercomputing-class performance through architectural innovations including fourth-generation Tensor Cores, a new Transformer Engine for accelerating LLMs and the latest NVLink technology that lets GPUs talk to each other at 900 GB/sec.

Chinese Tech Firms Buying Plenty of NVIDIA Enterprise GPUs

TikTok developer ByteDance, and other major Chinese tech firms including Tencent, Alibaba and Baidu are reported (by local media) to be snapping up lots of NVIDIA HPC GPUs, with even more orders placed this year. ByteDance is alleged to have spent enough on new products in 2023 to match the expenditure of the entire Chinese tech market on similar NVIDIA purchases for FY2022. According to news publication Jitwei, ByteDance has placed orders totaling $1 billion so far this year with Team Green—the report suggests that a mix of A100 and H800 GPU shipments have been sent to the company's mainland data centers.

The older Ampere-based A100 units were likely ordered prior to trade sanctions enforced on China post-August 2022, with further wiggle room allowed—meaning that shipments continued until September. The H800 GPU is a cut-down variant of 2022's flagship "Hopper" H100 model, designed specifically for the Chinese enterprise market—with reduced performance in order to meet export restriction standards. The H800 costs around $10,000 (average sale price per accelerator) according to Tom's Hardware, so it must offer some level of potency at that price. ByteDance has ordered roughly 100,000 units—with an unspecified split between H800 and A100 stock. Despite the development of competing HPC products within China, it seems that the nation's top-flight technology companies are heading directly to NVIDIA to acquire the best-of-the-best and highly mature AI processing hardware.

ASUS Announces NVIDIA-Certified Servers and ProArt Studiobook Pro 16 OLED at GTC

ASUS today announced its participation in NVIDIA GTC, a developer conference for the era of AI and the metaverse. ASUS will offer comprehensive NVIDIA-certified server solutions that support the latest NVIDIA L4 Tensor Core GPU—which accelerates real-time video AI and generative AI—as well as the NVIDIA BlueField -3 DPU, igniting unprecedented innovation for supercomputing infrastructure. ASUS will also launch the new ProArt Studiobook Pro 16 OLED laptop with the NVIDIA RTX 3000 Ada Generation Laptop GPU for mobile creative professionals.

Purpose-built GPU servers for generative AI
Generative AI applications enable businesses to develop better products and services, and deliver original content tailored to the unique needs of customers and audiences. ASUS ESC8000 and ESC4000 are fully certified NVIDIA servers that support up to eight NVIDIA L4 Tensor Core GPUs, which deliver universal acceleration and energy efficiency for AI with up to 2.7X more generative AI performance than the previous GPU generation. ASUS ESC and RS series servers are engineered for HPC workloads, with support for the NVIDIA Bluefield-3 DPU to transform data center infrastructure, as well as NVIDIA AI Enterprise applications for streamlined AI workflows and deployment.

NVIDIA Redefines Workstations to Power New Era of AI, Design, Industrial Metaverse

NVIDIA today announced six new NVIDIA RTX Ada Lovelace architecture GPUs for laptops and desktops, which enable creators, engineers and data scientists to meet the demands of the new era of AI, design and the metaverse. Using the new NVIDIA RTX GPUs with NVIDIA Omniverse, a platform for building and operating metaverse applications, designers can simulate a concept before making it a reality, planners can visualize an entire factory before it is built and engineers can evaluate their designs in real time.

The NVIDIA RTX 5000, RTX 4000, RTX 3500, RTX 3000 and RTX 2000 Ada Generation laptop GPUs deliver breakthrough performance and up to 2x the efficiency of the previous generation to tackle the most demanding workflows. For the desktop, the NVIDIA RTX 4000 Small Form Factor (SFF) Ada Generation GPU features new RT Cores, Tensor Cores and CUDA cores with 20 GB of graphics memory to deliver incredible performance in a compact card.

NVIDIA's New Ada Lovelace RTX GPU Arrives for Designers and Creators

Opening a new era of neural graphics that marries AI and simulation, NVIDIA today announced the NVIDIA RTX 6000 workstation GPU, based on its new NVIDIA Ada Lovelace architecture. With the new NVIDIA RTX 6000 Ada Generation GPU delivering real-time rendering, graphics and AI, designers and engineers can drive cutting-edge, simulation-based workflows to build and validate more sophisticated designs. Artists can take storytelling to the next level, creating more compelling content and building immersive virtual environments. Scientists, researchers and medical professionals can accelerate the development of life-saving medicines and procedures with supercomputing power on their workstations—all at up to 2-4x the performance of the previous-generation RTX A6000.

Designed for neural graphics and advanced virtual world simulation, the RTX 6000, with Ada generation AI and programmable shader technology, is the ideal platform for creating content and tools for the metaverse with NVIDIA Omniverse Enterprise. Incorporating the latest generations of render, AI and shader technologies and 48 GB of GPU memory, the RTX 6000 enables users to create incredibly detailed content, develop complex simulations and form the building blocks required to construct compelling and engaging virtual worlds.

NVIDIA Delivers Quantum Leap in Performance, Introduces New Era of Neural Rendering With GeForce RTX 40 Series

NVIDIA today unveiled the GeForce RTX 40 Series of GPUs, designed to deliver revolutionary performance for gamers and creators, led by its new flagship, the RTX 4090 GPU, with up to 4x the performance of its predecessor. The world's first GPUs based on the new NVIDIA Ada Lovelace architecture, the RTX 40 Series delivers massive generational leaps in performance and efficiency, and represents a new era of real-time ray tracing and neural rendering, which uses AI to generate pixels.

"The age of RTX ray tracing and neural rendering is in full steam, and our new Ada Lovelace architecture takes it to the next level," said Jensen Huang, NVIDIA's founder and CEO, at the GeForce Beyond: Special Broadcast at GTC. "Ada provides a quantum leap for gamers and paves the way for creators of fully simulated worlds. With up to 4x the performance of the previous generation, Ada is setting a new standard for the industry," he said.

AMD WMMA Instruction is Direct Response to NVIDIA Tensor Cores

AMD's RDNA3 graphics IP is just around the corner, and we are hearing more information about the upcoming architecture. Historically, as GPUs advance, it is not unusual for companies to add dedicated hardware blocks to accelerate a specific task. Today, AMD engineers have updated the backend of the LLVM compiler to include a new instruction called Wave Matrix Multiply-Accumulate (WMMA). This instruction will be present on GFX11, which is the RDNA3 GPU architecture. With WMMA, AMD will offer support for processing 16x16x16 size tensors in FP16 and BF16 precision formats. With these instructions, AMD is adding new arrangements to support the processing of matrix multiply-accumulate operations. This is closely mimicking the work NVIDIA is doing with Tensor Cores.

AMD ROCm 5.2 API update lists the use case for this type of instruction, which you can see below:
rocWMMA provides a C++ API to facilitate breaking down matrix multiply accumulate problems into fragments and using them in block-wise operations that are distributed in parallel across GPU wavefronts. The API is a header library of GPU device code, meaning matrix core acceleration may be compiled directly into your kernel device code. This can benefit from compiler optimization in the generation of kernel assembly and does not incur additional overhead costs of linking to external runtime libraries or having to launch separate kernels.

rocWMMA is released as a header library and includes test and sample projects to validate and illustrate example usages of the C++ API. GEMM matrix multiplication is used as primary validation given the heavy precedent for the library. However, the usage portfolio is growing significantly and demonstrates different ways rocWMMA may be consumed.
Return to Keyword Browsing
Dec 18th, 2024 21:39 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts