NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

AleksandarK · Oct 17, 2024

NVIDIA has officially released its Llama-3.1-Nemotron-70B-Instruct model. Based on META's Llama3.1 70B, the Nemotron model is a large language model customized by NVIDIA in order to improve the helpfulness of LLM-generated responses. NVIDIA uses fine-tuning structured data to steer the model and allow it to generate more helpful responses. With only 70 billion parameters, the model is punching far above its weight class. The company claims that the model is beating the current top models from leading labs like OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet, which are the current leaders across AI benchmarks. In evaluations such as Arena Hard, the NVIDIA Llama3.1 Nemotron 70B is scoring 85 points, while GPT-4o and Sonnet 3.5 score 79.3 and 79.2, respectively. Other benchmarks like AlpacaEval and MT-Bench spot NVIDIA also hold the top spot, with 57.6 and 8.98 scores earned. Claude and GPT reach 52.4 / 8.81 and 57.5 / 8.74, just below Nemotron.

This language model underwent training using reinforcement learning from human feedback (RLHF), specifically employing the REINFORCE algorithm. The process involved a reward model based on a large language model architecture and custom preference prompts designed to guide the model's behavior. The training began with a pre-existing instruction-tuned language model as the starting point. It was trained on Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts on a Llama-3.1-70B-Instruct model as the initial policy. Running the model locally requires either four 40 GB or two 80 GB VRAM GPUs and 150 GB of free disk space. We managed to take it for a spin on NVIDIA's website to say hello to TechPowerUp readers. The model also passes the infamous "strawberry" test, where it has to count the number of specific letters in a word, however, it appears that it was part of the fine-tuning data as it fails the next test, shown in the image below.

View at TechPowerUp Main Site | Source

JWNoctis · Oct 17, 2024

FWIW you can try the thing on your own PC, quantized, with Ollama/llama.cpp and a reasonably new setup with >=64GB of RAM. 32GB RAM + 16GB VRAM might also work with GPU offload, and a newer NVIDIA video card with the proper CUDA setup could speed up prompt processing to something tolerable, but generation would still be slow, unless you have that much memory in pure VRAM.

65B-70B model size had been where things start to get really interesting, and the largest model a good home PC could locally run without excessively damaging quantization. They can be fun to play around with, and put quite a few things into perspective.

AleksandarK · Oct 17, 2024

JWNoctis said:
FWIW you can try the thing on your own PC, quantized, with Ollama/llama.cpp and a reasonably new setup with >=64GB of RAM. 32GB RAM + 16GB VRAM might also work with GPU offload, and a newer NVIDIA video card with the proper CUDA setup could speed up prompt processing to something tolerable, but generation would still be slow, unless you have that much memory in pure VRAM.

65B-70B model size had been where things start to get really interesting, and the largest model a good home PC could locally run without excessively damaging quantization. They can be fun to play around with, and put quite a few things into perspective.

IIRC 4 bit quants should allow it to run on PC(high end). Id love to one day run model of this size locally!

Solid State Brain · Oct 17, 2024

How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

AleksandarK · Oct 17, 2024

Solid State Brain said:
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

There are options to use threadripper+128GB RAM. That should cost less than two GPUs

JWNoctis · Oct 17, 2024

AleksandarK said:
IIRC 4 bit quants should allow it to run on PC(high end). Id love to one day run model of this size locally!

Solid State Brain said:
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

I could only say there is a difference between "acceptable speed for production purposes" and "acceptable speed for some little experiment that takes the time of a cup of tea for each prompt." RAM is probably the only way to do it at reasonable cost in consumer PC right now. 2x3090 or 2x4090 for this sole purpose would be rather extravagant.

Bloste · Oct 17, 2024

Up - 1 R

Yep, it works fine.

Wirko · Oct 17, 2024

AleksandarK said:
There are options to use threadripper+128GB RAM. That should cost less than two GPUs

JWNoctis said:
"acceptable speed for some little experiment that takes the time of a cup of tea for each prompt."

Even a Ryzen 8700G with 128 GB RAM might be somewhat usable (depending on user's tolerance to giga-amounts of tea/coffee/beer/popcorn/peanuts/pizza).

Jun · Oct 17, 2024

It passes the strawberry test unlike GPT and Bing, but no one told it about the techpowerup test.

beautyless · Oct 17, 2024

Bloste said:
Up - 1 R

Yep, it works fine.

It seems not understand common sense. And need to learn.

Wirko · Oct 17, 2024

Jun said:
It passes the strawberry test unlike GPT and Bing, but no one told it about the techpowerup test.

Beware of "reinforcement learning from human feedback (RLHF)". Tell it several times there are 18 Rs in "up", and it will tell you back that there are ~19 in TechPowerUp.

AusWolf · Oct 17, 2024

So it writes you an essay when you ask something simple that could be answered with a single number. Um... great? :wtf:

I was taught in school that using longer sentences than necessary to convey meaning lowers the quality of any scientific work. Just sayin'.

Wirko said:
Beware of "reinforcement learning from human feedback (RLHF)". Tell it several times there are 18 Rs in "up", and it will tell you back that there are ~19 in TechPowerUp.

This is exactly why AI is useless in the hands of critical thinkers, and dangerous in the hands of others.

MacZ · Oct 17, 2024

Until LLMs are reliable to a very small epsilon, they are a gadget and a novelty. If you need an expert to check any output for correctness, it his hard to see where the progress is.

AusWolf · Oct 17, 2024

MacZ said:
Until LLMs are reliable to a very small epsilon, they are a gadget and a novelty. If you need an expert to check any output for correctness, it his hard to see where the progress is.

It will never be reliable because
1. It is trained by unreliable humans, and
2. The more LLM-generated answers form the basis of future answers, the more errors any new answer will produce.

SOAREVERSOR · Oct 17, 2024

JWNoctis said:
I could only say there is a difference between "acceptable speed for production purposes" and "acceptable speed for some little experiment that takes the time of a cup of tea for each prompt." RAM is probably the only way to do it at reasonable cost in consumer PC right now. 2x3090 or 2x4090 for this sole purpose would be rather extravagant.

It's not extravagant. The point of the 3090 and the 4090 is not gaming but stuff like this. Granted the people that do stuff like this do far more than just this. But this is less of a waste of a GPU than gaming is.

Solid State Brain · Oct 17, 2024

SOAREVERSOR said:
It's not extravagant. The point of the 3090 and the 4090 is not gaming but stuff like this. Granted the people that do stuff like this do far more than just this. But this is less of a waste of a GPU than gaming is.

For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: https://www.reddit.com/r/LocalLLaMA/

jonny2772 · Oct 17, 2024

Solid State Brain said:
For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: https://www.reddit.com/r/LocalLLaMA/

View attachment 367943

Some of those can get well into the "do you really need all that?" levels. I remember quite well a 10 x 3090 build that once showed up in there, made me feel extremely poor with "only" a dual 3090 build. That thing looked like we went back to the gpu mining era.

Edit: this one right here.

LittleBro · Oct 17, 2024

Bloste said:
Up - 1 R

Yep, it works fine.

That's it, this is Jensen's golden-egg-laying duck "AI". Dumb as f*ck. We are wasting electricity and money for accelerating this hallucinating "AI". What a waste of resources ...

igormp · Oct 17, 2024

Solid State Brain said:
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

Yeah, I have run llama 70b with my 2x3090 (vram usage is closer to 35~40GB fwiw).
If the 5090 really comes with 32GB, those 64GB with a pair will be reall awesome.

jonny2772 · Oct 17, 2024

igormp said:
Yeah, I have run llama 70b with my 2x3090 (vram usage is closer to 35~40GB fwiw).
If the 5090 really comes with 32GB, those 64GB with a pair will be reall awesome.

64GB is the dream, but if the 5090 comes out with the price and TDP that are being rumored, the trend will stay on "just hoard more 3090s". If a single 5090 ends up being the $2500 some people are claiming, you can get 3 3090s and 72GB for that amount, maybe even 4 if you really scour the used market. Or at least you can, assuming it doesn't pull a P40 and the used prices jump to double of what they were before people found out about getting used Tesla cards.

JWNoctis · Oct 17, 2024

Solid State Brain said:
For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: https://www.reddit.com/r/LocalLLaMA/

View attachment 367943

I had no idea people were doing that, but I could see why. Quite a lot of those roleplays would have been unprintably intense one way or another, and won't do with the ToS of almost any mass-market model behind API, which are not specifically finetuned to do well in this use case anyway.

It is somewhere no one had gone before until very recently, too; I don't think many could do that kind of roleplay with the level of responsiveness, variety, and apparent quality a GPU-hosted roleplay-finetuned LLM would provide. Play it like a holodeck, or even play it like the Matrix where you controlled everything. Yep, could see the appeal there.

But then again, many of the roleplay finetunes are also quite excellent for light scene writing and bouncing ideas off of and see how it sticks. They are already fit for use where no mathematics-level correctness nor absolute fidelity is required.

For that matter they are good for saving developer time too - sometimes all you need is a pointer to the right direction, and LLMs are good for that, even if the specific details might not be right.

Zazigalka · Oct 17, 2024

NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

wat ?

cvaldes · Oct 17, 2024

Zazigalka said:
wat ?

Read the original article (post #1). It explains what the headline states. This is typical in expository writing.

What is implied but not explicitly stated is that whittling down LLaMa 3.1 to a 70 billion parameter model means that Nvidia improved on performance-per-watt and performance-per-dollar metrics by reducing the hardware requirements necessary to run an AI chatbot that compares favorably to GPT-4o and Claude 3.5 Sonnet.

Why is this important? One achievement that many are striving to attain is putting a useful model on a handheld device like a smartphone. I don't know if these model sizes scale linearly but perhaps a 8-10 billion parameter model could fit on a modern premium smartphone (8GB RAM).

AI adoption will take off once it runs well (and practically) on a smartphone, the primary computing modality of consumers in 2024.

Running an LLM AI chatbot on some PC loaded with two 24GB GPUs or pumped up with 128GB system RAM is not where consumer innovation is going to happen. It will happen in the handheld sector.

igormp · Oct 17, 2024

jonny2772 said:
64GB is the dream, but if the 5090 comes out with the price and TDP that are being rumored, the trend will stay on "just hoard more 3090s". If a single 5090 ends up being the $2500 some people are claiming, you can get 3 3090s and 72GB for that amount, maybe even 4 if you really scour the used market. Or at least you can, assuming it doesn't pull a P40 and the used prices jump to double of what they were before people found out about getting used Tesla cards.

Yeah, at some point it may even be more worth to just rent out H100s to fine tune your models and then just use your local setup for inference. The model in question here is proof that a good finetune can take you a long way.

Getting more than 2 GPUs would be hard for me since I'm on a consumer platform that is limited in lanes/slots

Using an used/older server/WS platform would consume way too much power and be lacking in CPU perf, not to say the space required.

Doing wack jobs for splitting a single x8 slot into x4/x4 for more GPUs could be an option, but then the bandwidth limitation for training would still be annoying (not so much of an issue for inference tho).

But that's pretty much a rich-people-problem situation anyway haha

Zazigalka said:
wat ?

This is talking about LLMs (such as what's running behind ChatGPT), and how Nvidia created a model based on Meta's Llama (which is open source) with way less parameters than what's usual, but with amazing performance.

Solid State Brain · Oct 17, 2024

cvaldes said:
Why is this important? One achievement that many are striving to attain is putting a useful model on a handheld device like a smartphone. I don't know if these model sizes scale linearly but perhaps a 8-10 billion parameter model could fit on a modern premium smartphone (8GB RAM).

Smaller models that can fit within the memory of a modern high-end smartphone already exist, but they're nowhere as capable as larger ones—they have less attention to detail, less knowledge, less answer accuracy, can get off the rails/incoherent more easily, can't understand complex instructions, etc; additionally, quantization beyond a certain level (think of quantization like reducing the color depth of an image to save memory) can degrade performance over the standard 16-bit baseline as well.

An 8 Billion parameter model in 4-bit (0.5 bytes per parameter) would take about 4 GB of memory, then additional memory on top of that for processing text/requests/chatting, but 8GB should be sufficient. The biggest limitation would probably be the bandwidth of even fast LPDDR5X memory (the entirety of the model weights must be read for every token generated, although the maximum theoretical speed for a 64-bit LPDDR5X-8400 module iterating over a 4GB LLM would be acceptable—roughly 17 tokens/s), then possibly power consumption.

System Name	Kuro
Processor	AMD Ryzen 7 7800X3D@65W
Motherboard	MSI MAG B650 Tomahawk WiFi
Cooling	Thermalright Phantom Spirit 120 EVO
Memory	Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s)	PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage	Crucial T500 2TB + WD Blue 8TB
Case	Lian Li LANCOOL 216
Power Supply	MSI MPG A850G
Software	Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores	17761 C23 Multi@65W

Processor	Intel i7-12700K
Motherboard	MSI PRO Z690-A WIFI
Cooling	Noctua NH-D15S
Memory	Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s)	MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage	Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case	Fractal Define C
Power Supply	Corsair RM850x
Mouse	Logitech G203
Software	openSUSE Tumbleweed

System Name	Kuro
Processor	AMD Ryzen 7 7800X3D@65W
Motherboard	MSI MAG B650 Tomahawk WiFi
Cooling	Thermalright Phantom Spirit 120 EVO
Memory	Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s)	PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage	Crucial T500 2TB + WD Blue 8TB
Case	Lian Li LANCOOL 216
Power Supply	MSI MPG A850G
Software	Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores	17761 C23 Multi@65W

System Name	Karen
Processor	AMD Ryzen 7 7800X3D
Motherboard	MSI Mag X670E Tomahawk WiFi
Cooling	Thermalright PS120SE
Video Card(s)	Gigabyte RTX 3080 Gaming OC 10G
Storage	Yes
Case	Corsair 4000D
Power Supply	Corsair RM850x

Processor	i5-6600K
Motherboard	Asus Z170A
Cooling	some cheap Cooler Master Hyper 103 or similar
Memory	16GB DDR4-2400
Video Card(s)	IGP
Storage	Samsung 850 EVO 250GB
Display(s)	2x Oldell 24" 1920x1200
Case	Bitfenix Nova white windowless non-mesh
Audio Device(s)	E-mu 1212m PCI
Power Supply	Seasonic G-360
Mouse	Logitech Marble trackball, never had a mouse
Keyboard	Key Tronic KT2000, no Win key because 1994
Software	Oldwin

NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

AleksandarK

News Editor

JWNoctis

AleksandarK

News Editor

Solid State Brain

AleksandarK

News Editor

JWNoctis

Bloste

Wirko

Jun

beautyless

Wirko

AusWolf

MacZ

AusWolf

SOAREVERSOR

Solid State Brain

jonny2772

LittleBro

igormp

jonny2772

JWNoctis

Zazigalka