NVIDIA Outlines Cost Benefits of Inference Platform

T0@st · Monday at 1:54 PM

Businesses across every industry are rolling out AI services this year. For Microsoft, Oracle, Perplexity, Snap and hundreds of other leading companies, using the NVIDIA AI inference platform—a full stack comprising world-class silicon, systems and software—is the key to delivering high-throughput and low-latency inference and enabling great user experiences while lowering cost. NVIDIA's advancements in inference software optimization and the NVIDIA Hopper platform are helping industries serve the latest generative AI models, delivering excellent user experiences while optimizing total cost of ownership. The Hopper platform also helps deliver up to 15x more energy efficiency for inference workloads compared to previous generations.

AI inference is notoriously difficult, as it requires many steps to strike the right balance between throughput and user experience. But the underlying goal is simple: generate more tokens at a lower cost. Tokens represent words in a large language model (LLM) system—and with AI inference services typically charging for every million tokens generated, this goal offers the most visible return on AI investments and energy used per task. Full-stack software optimization offers the key to improving AI inference performance and achieving this goal.

Cost-Effective User Throughput
Businesses are often challenged with balancing the performance and costs of inference workloads. While some customers or use cases may work with an out-of-the-box or hosted model, others may require customization. NVIDIA technologies simplify model deployment while optimizing cost and performance for AI inference workloads. In addition, customers can experience flexibility and customizability with the models they choose to deploy.

NVIDIA NIM microservices, NVIDIA Triton Inference Server and the NVIDIA TensorRT library are among the inference solutions NVIDIA offers to suit users' needs:

NVIDIA NIM inference microservices are prepackaged and performance-optimized for rapidly deploying AI foundation models on any infrastructure—cloud, data centers, edge or workstations.
NVIDIA Triton Inference Server, one of the company's most popular open-source projects, allows users to package and serve any model regardless of the AI framework it was trained on.
NVIDIA TensorRT is a high-performance deep learning inference library that includes runtime and model optimizations to deliver low-latency and high-throughput inference for production applications.

Available in all major cloud marketplaces, the NVIDIA AI Enterprise software platform includes all these solutions and provides enterprise-grade support, stability, manageability and security.

With the framework-agnostic NVIDIA AI inference platform, companies save on productivity, development, and infrastructure and setup costs. Using NVIDIA technologies can also boost business revenue by helping companies avoid downtime and fraudulent transactions, increase e-commerce shopping conversion rates and generate new, AI-powered revenue streams.

Cloud-Based LLM Inference
To ease LLM deployment, NVIDIA has collaborated closely with every major cloud service provider to ensure that the NVIDIA inference platform can be seamlessly deployed in the cloud with minimal or no code required. NVIDIA NIM is integrated with cloud-native services such as:

Amazon SageMaker AI, Amazon Bedrock Marketplace, Amazon Elastic Kubernetes Service
Google Cloud's Vertex AI, Google Kubernetes Engine
Microsoft Azure AI Foundry coming soon, Azure Kubernetes Service
Oracle Cloud Infrastructure's data science tools, Oracle Cloud Infrastructure Kubernetes Engine

Plus, for customized inference deployments, NVIDIA Triton Inference Server is deeply integrated into all major cloud service providers.

For example, using the OCI Data Science platform, deploying NVIDIA Triton is as simple as turning on a switch in the command line arguments during model deployment, which instantly launches an NVIDIA Triton inference endpoint.

Similarly, with Azure Machine Learning, users can deploy NVIDIA Triton either with no-code deployment through the Azure Machine Learning Studio or full-code deployment with Azure Machine Learning CLI. AWS provides one-click deployment for NVIDIA NIM from SageMaker Marketplace and Google Cloud provides a one-click deployment option on Google Kubernetes Engine (GKE). Google Cloud provides a one-click deployment option on Google Kubernetes Engine, while AWS offers NVIDIA Triton on its AWS Deep Learning containers.

The NVIDIA AI inference platform also uses popular communication methods for delivering AI predictions, automatically adjusting to accommodate the growing and changing needs of users within a cloud-based infrastructure.

From accelerating LLMs to enhancing creative workflows and transforming agreement management, NVIDIA's AI inference platform is driving real-world impact across industries. Learn how collaboration and innovation are enabling the organizations below to achieve new levels of efficiency and scalability.

The full article can be found here.

Learn more about how NVIDIA is delivering breakthrough inference performance results and stay up to date with the latest AI inference performance updates.

View at TechPowerUp Main Site | Source

Daven · Monday at 2:00 PM

I'm guessing this press release is due to the DeepSeek debacle but is Nvidia really trying to sell us on lower TCO after saying things like "The more you buy, the more you save"? This company truly represents the absolute worse of our humanity. Nvidia has been raking in 75-80% margins by charging $45,000 for their data center GPUs which means the cost is $9,000. Greed can really come back and bite you in the ass.

igormp · Monday at 2:23 PM

Too bad they still haven't come up with a proper solution for GPU sharing inside of k8s. Having to deal with MPS is far from ideal, and stuff like vGPUs and MIG are not that flexible

ThomasK · Monday at 2:46 PM

This paid PR sounds very much like desperation.

dyonoctis · Monday at 3:14 PM

Daven said:
I'm guessing this press release is due to the DeepSeek debacle but is Nvidia really trying to sell us on lower TCO after saying things like "The more you buy, the more you save"? This company truly represents the absolute worse of our humanity. Nvidia has been raking in 75-80% margins by charging $45,000 for their data center GPUs which means the cost is $9,000. Greed can really come back and bite you in the ass.

Things could have been worse. There that one guy who increased the price of a drug by 11 000%. 75-80% is the common margin of big pharmaceutical companies. And big pharmacy happens to be one of NVIDIA customers

This Big Pharma Is Teaming Up With Nvidia to Make Drugs With AI. Here's What You Need to Know | The Motley Fool

It isn't the first drugmaker to partner with the chipmaker, and it won't be the last.

www.fool.com

Steevo · Monday at 4:38 PM

SOAREVERSOR · Tuesday at 5:56 AM

Daven said:
I'm guessing this press release is due to the DeepSeek debacle but is Nvidia really trying to sell us on lower TCO after saying things like "The more you buy, the more you save"? This company truly represents the absolute worse of our humanity. Nvidia has been raking in 75-80% margins by charging $45,000 for their data center GPUs which means the cost is $9,000. Greed can really come back and bite you in the ass.

The issue is this isn't what's happening.

Point blank most "AI" is not using professional grade stuff and those who do have so much money to set on fire the cost doesn't matter. In most "AI" cases since the GTX 8800 people are just buying up rack tons of evga (dead), ASUS, founders, gigabyte, and MSI cards and kit bashing them into a system as CUDA does not care.

But that's the crux of it. What sells nvidia is not the GPU but CUDA. If someone can take a crack at nvidia's entire ecosystem it comes crashing down. This won't translate into cheaper gaming GPUs it will mean that high end GPUs are no longer worth the effort to develop. Outside of flex AI cases. That's nvidias entire house of cards. GPUs for gaming are not worth the investment at the high end. AI is only worth it if you have an ecosystem lock. Kill that lock and good bye.

kondamin · Tuesday at 7:42 AM

SOAREVERSOR said:
The issue is this isn't what's happening.

Point blank most "AI" is not using professional grade stuff and those who do have so much money to set on fire the cost doesn't matter. In most "AI" cases since the GTX 8800 people are just buying up rack tons of evga (dead), ASUS, founders, gigabyte, and MSI cards and kit bashing them into a system as CUDA does not care.

But that's the crux of it. What sells nvidia is not the GPU but CUDA. If someone can take a crack at nvidia's entire ecosystem it comes crashing down. This won't translate into cheaper gaming GPUs it will mean that high end GPUs are no longer worth the effort to develop. Outside of flex AI cases. That's nvidias entire house of cards. GPUs for gaming are not worth the investment at the high end. AI is only worth it if you have an ecosystem lock. Kill that lock and good bye.

I doubt CUDA is all that important to big users like open ai meta and I’m fairly certain those are working bare metal anyway

igormp · Tuesday at 2:57 PM

kondamin said:
I doubt CUDA is all that important to big users like open ai meta and I’m fairly certain those are working bare metal anyway

Even if bare metal, they're still using CUDA.
CUDA is the API that allow them to do GPGPU on Nvidia GPUs, and is the major API supported by all the big machine learning frameworks.
AMD is trying to play catch up with ROCm, and are getting some traction, but are still way behind Nvidia.

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

Processor	Ryzen 7 7800X3D
Motherboard	ASRock B650M PG Riptide
Cooling	AMD Wraith Max + 2x Noctua Redux NF-P14r + 2x NF-P12
Memory	2x16GB ADATA XPG Lancer Blade DDR5-6000
Video Card(s)	Powercolor RX 7800 XT Fighter OC
Storage	ADATA Legend 970 2TB PCIe 5.0
Display(s)	Dell 32" S3222DGM - 1440P 165Hz + P2422H
Case	HYTE Y40
Audio Device(s)	Microsoft Xbox TLL-00008
Power Supply	Cooler Master MWE 750 V2
Mouse	Alienware AW320M
Keyboard	Alienware AW510K
Software	Windows 11 Pro

Processor	AMD Ryzen 3700x
Motherboard	asus ROG Strix B-350I Gaming
Cooling	Deepcool LS520 SE
Memory	crucial ballistix 32Gb DDR4
Video Card(s)	RTX 3070 FE
Storage	WD sn550 1To/WD ssd sata 1To /WD black sn750 1To/Seagate 2To/WD book 4 To back-up
Display(s)	LG GL850
Case	Dan A4 H2O
Audio Device(s)	sennheiser HD58X
Power Supply	Corsair SF600
Mouse	MX master 3
Keyboard	Master Key Mx
Software	win 11 pro

System Name	Compy 386
Processor	7800X3D
Motherboard	Asus
Cooling	Air for now.....
Memory	64 GB DDR5 6400Mhz
Video Card(s)	7900XTX 310 Merc
Storage	Samsung 990 2TB, 2 SP 2TB SSDs, 24TB Enterprise drives
Display(s)	55" Samsung 4K HDR
Audio Device(s)	ATI HDMI
Mouse	Logitech MX518
Keyboard	Razer
Software	A lot.
Benchmark Scores	Its fast. Enough.

Processor	5950x
Motherboard	B550 ProArt
Cooling	Fuma 2
Memory	4x32GB 3200MHz Corsair LPX
Video Card(s)	2x RTX 3090
Display(s)	LG 42" C2 4k OLED
Power Supply	XPG Core Reactor 850W
Software	I use Arch btw

NVIDIA Outlines Cost Benefits of Inference Platform

News Editor