Thursday, October 17th 2024

NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

Oct 17th, 2024 04:21 Discuss (31 Comments)

NVIDIA has officially released its Llama-3.1-Nemotron-70B-Instruct model. Based on META's Llama3.1 70B, the Nemotron model is a large language model customized by NVIDIA in order to improve the helpfulness of LLM-generated responses. NVIDIA uses fine-tuning structured data to steer the model and allow it to generate more helpful responses. With only 70 billion parameters, the model is punching far above its weight class. The company claims that the model is beating the current top models from leading labs like OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet, which are the current leaders across AI benchmarks. In evaluations such as Arena Hard, the NVIDIA Llama3.1 Nemotron 70B is scoring 85 points, while GPT-4o and Sonnet 3.5 score 79.3 and 79.2, respectively. Other benchmarks like AlpacaEval and MT-Bench spot NVIDIA also hold the top spot, with 57.6 and 8.98 scores earned. Claude and GPT reach 52.4 / 8.81 and 57.5 / 8.74, just below Nemotron.

This language model underwent training using reinforcement learning from human feedback (RLHF), specifically employing the REINFORCE algorithm. The process involved a reward model based on a large language model architecture and custom preference prompts designed to guide the model's behavior. The training began with a pre-existing instruction-tuned language model as the starting point. It was trained on Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts on a Llama-3.1-70B-Instruct model as the initial policy. Running the model locally requires either four 40 GB or two 80 GB VRAM GPUs and 150 GB of free disk space. We managed to take it for a spin on NVIDIA's website to say hello to TechPowerUp readers. The model also passes the infamous "strawberry" test, where it has to count the number of specific letters in a word, however, it appears that it was part of the fine-tuning data as it fails the next test, shown in the image below.

Sources: NVIDIA, HuggingFace

Add your own comment

31 Comments on NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

JWNoctis

FWIW you can try the thing on your own PC, quantized, with Ollama/llama.cpp and a reasonably new setup with >=64GB of RAM. 32GB RAM + 16GB VRAM might also work with GPU offload, and a newer NVIDIA video card with the proper CUDA setup could speed up prompt processing to something tolerable, but generation would still be slow, unless you have that much memory in pure VRAM.

65B-70B model size had been where things start to get really interesting, and the largest model a good home PC could locally run without excessively damaging quantization. They can be fun to play around with, and put quite a few things into perspective.

AleksandarK

News Editor

JWNoctisFWIW you can try the thing on your own PC, quantized, with Ollama/llama.cpp and a reasonably new setup with >=64GB of RAM. 32GB RAM + 16GB VRAM might also work with GPU offload, and a newer NVIDIA video card with the proper CUDA setup could speed up prompt processing to something tolerable, but generation would still be slow, unless you have that much memory in pure VRAM.

65B-70B model size had been where things start to get really interesting, and the largest model a good home PC could locally run without excessively damaging quantization. They can be fun to play around with, and put quite a few things into perspective.

IIRC 4 bit quants should allow it to run on PC(high end). Id love to one day run model of this size locally! :)

Solid State Brain

How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

AleksandarK

News Editor

Solid State BrainHow much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

There are options to use threadripper+128GB RAM. That should cost less than two GPUs :)

JWNoctis

AleksandarKIIRC 4 bit quants should allow it to run on PC(high end). Id love to one day run model of this size locally! :)

Solid State BrainHow much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

I could only say there is a difference between "acceptable speed for production purposes" and "acceptable speed for some little experiment that takes the time of a cup of tea for each prompt." RAM is probably the only way to do it at reasonable cost in consumer PC right now. 2x3090 or 2x4090 for this sole purpose would be rather extravagant.

Bloste

Up - 1 R

Yep, it works fine.

Wirko

AleksandarKThere are options to use threadripper+128GB RAM. That should cost less than two GPUs :)

JWNoctis"acceptable speed for some little experiment that takes the time of a cup of tea for each prompt."

Even a Ryzen 8700G with 128 GB RAM might be somewhat usable (depending on user's tolerance to giga-amounts of tea/coffee/beer/popcorn/peanuts/pizza).

Jun

It passes the strawberry test unlike GPT and Bing, but no one told it about the techpowerup test.

beautyless

BlosteUp - 1 R

Yep, it works fine.

It seems not understand common sense. And need to learn.

#10

Wirko

JunIt passes the strawberry test unlike GPT and Bing, but no one told it about the techpowerup test.

Beware of "reinforcement learning from human feedback (RLHF)". Tell it several times there are 18 Rs in "up", and it will tell you back that there are ~19 in TechPowerUp.

#11

AusWolf

So it writes you an essay when you ask something simple that could be answered with a single number. Um... great? :wtf:

I was taught in school that using longer sentences than necessary to convey meaning lowers the quality of any scientific work. Just sayin'.

WirkoBeware of "reinforcement learning from human feedback (RLHF)". Tell it several times there are 18 Rs in "up", and it will tell you back that there are ~19 in TechPowerUp.

This is exactly why AI is useless in the hands of critical thinkers, and dangerous in the hands of others.

#12

MacZ

Until LLMs are reliable to a very small epsilon, they are a gadget and a novelty. If you need an expert to check any output for correctness, it his hard to see where the progress is.

#13

AusWolf

MacZUntil LLMs are reliable to a very small epsilon, they are a gadget and a novelty. If you need an expert to check any output for correctness, it his hard to see where the progress is.

It will never be reliable because
1. It is trained by unreliable humans, and
2. The more LLM-generated answers form the basis of future answers, the more errors any new answer will produce.

#14

SOAREVERSOR

JWNoctisI could only say there is a difference between "acceptable speed for production purposes" and "acceptable speed for some little experiment that takes the time of a cup of tea for each prompt." RAM is probably the only way to do it at reasonable cost in consumer PC right now. 2x3090 or 2x4090 for this sole purpose would be rather extravagant.

It's not extravagant. The point of the 3090 and the 4090 is not gaming but stuff like this. Granted the people that do stuff like this do far more than just this. But this is less of a waste of a GPU than gaming is.

#15

Solid State Brain

SOAREVERSORIt's not extravagant. The point of the 3090 and the 4090 is not gaming but stuff like this. Granted the people that do stuff like this do far more than just this. But this is less of a waste of a GPU than gaming is.

For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: www.reddit.com/r/LocalLLaMA/

#16

jonny2772

Solid State BrainFor what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: www.reddit.com/r/LocalLLaMA/

Some of those can get well into the "do you really need all that?" levels. I remember quite well a 10 x 3090 build that once showed up in there, made me feel extremely poor with "only" a dual 3090 build. That thing looked like we went back to the gpu mining era.

Edit: this one right here.

#17

LittleBro

BlosteUp - 1 R

Yep, it works fine.

That's it, this is Jensen's golden-egg-laying duck "AI". Dumb as f*ck. We are wasting electricity and money for accelerating this hallucinating "AI". What a waste of resources ...

#18

igormp

Solid State BrainHow much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.

Yeah, I have run llama 70b with my 2x3090 (vram usage is closer to 35~40GB fwiw).
If the 5090 really comes with 32GB, those 64GB with a pair will be reall awesome.

#19

jonny2772

igormpYeah, I have run llama 70b with my 2x3090 (vram usage is closer to 35~40GB fwiw).
If the 5090 really comes with 32GB, those 64GB with a pair will be reall awesome.

64GB is the dream, but if the 5090 comes out with the price and TDP that are being rumored, the trend will stay on "just hoard more 3090s". If a single 5090 ends up being the $2500 some people are claiming, you can get 3 3090s and 72GB for that amount, maybe even 4 if you really scour the used market. Or at least you can, assuming it doesn't pull a P40 and the used prices jump to double of what they were before people found out about getting used Tesla cards.

#20

JWNoctis

Solid State BrainFor what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: www.reddit.com/r/LocalLLaMA/

I had no idea people were doing that, but I could see why. Quite a lot of those roleplays would have been unprintably intense one way or another, and won't do with the ToS of almost any mass-market model behind API, which are not specifically finetuned to do well in this use case anyway.

It is somewhere no one had gone before until very recently, too; I don't think many could do that kind of roleplay with the level of responsiveness, variety, and apparent quality a GPU-hosted roleplay-finetuned LLM would provide. Play it like a holodeck, or even play it like the Matrix where you controlled everything. Yep, could see the appeal there.

But then again, many of the roleplay finetunes are also quite excellent for light scene writing and bouncing ideas off of and see how it sticks. They are already fit for use where no mathematics-level correctness nor absolute fidelity is required.

For that matter they are good for saving developer time too - sometimes all you need is a pointer to the right direction, and LLMs are good for that, even if the specific details might not be right.

#21

Zazigalka

NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

wat ?

#22

cvaldes

Zazigalkawat ?

Read the original article (post #1). It explains what the headline states. This is typical in expository writing.

What is implied but not explicitly stated is that whittling down LLaMa 3.1 to a 70 billion parameter model means that Nvidia improved on performance-per-watt and performance-per-dollar metrics by reducing the hardware requirements necessary to run an AI chatbot that compares favorably to GPT-4o and Claude 3.5 Sonnet.

Why is this important? One achievement that many are striving to attain is putting a useful model on a handheld device like a smartphone. I don't know if these model sizes scale linearly but perhaps a 8-10 billion parameter model could fit on a modern premium smartphone (8GB RAM).

AI adoption will take off once it runs well (and practically) on a smartphone, the primary computing modality of consumers in 2024.

Running an LLM AI chatbot on some PC loaded with two 24GB GPUs or pumped up with 128GB system RAM is not where consumer innovation is going to happen. It will happen in the handheld sector.

#23

igormp

jonny277264GB is the dream, but if the 5090 comes out with the price and TDP that are being rumored, the trend will stay on "just hoard more 3090s". If a single 5090 ends up being the $2500 some people are claiming, you can get 3 3090s and 72GB for that amount, maybe even 4 if you really scour the used market. Or at least you can, assuming it doesn't pull a P40 and the used prices jump to double of what they were before people found out about getting used Tesla cards.

Yeah, at some point it may even be more worth to just rent out H100s to fine tune your models and then just use your local setup for inference. The model in question here is proof that a good finetune can take you a long way.

Getting more than 2 GPUs would be hard for me since I'm on a consumer platform that is limited in lanes/slots :(
Using an used/older server/WS platform would consume way too much power and be lacking in CPU perf, not to say the space required.

Doing wack jobs for splitting a single x8 slot into x4/x4 for more GPUs could be an option, but then the bandwidth limitation for training would still be annoying (not so much of an issue for inference tho).

But that's pretty much a rich-people-problem situation anyway haha

Zazigalkawat ?

This is talking about LLMs (such as what's running behind ChatGPT), and how Nvidia created a model based on Meta's Llama (which is open source) with way less parameters than what's usual, but with amazing performance.

#24

Solid State Brain

cvaldesWhy is this important? One achievement that many are striving to attain is putting a useful model on a handheld device like a smartphone. I don't know if these model sizes scale linearly but perhaps a 8-10 billion parameter model could fit on a modern premium smartphone (8GB RAM).

Smaller models that can fit within the memory of a modern high-end smartphone already exist, but they're nowhere as capable as larger ones—they have less attention to detail, less knowledge, less answer accuracy, can get off the rails/incoherent more easily, can't understand complex instructions, etc; additionally, quantization beyond a certain level (think of quantization like reducing the color depth of an image to save memory) can degrade performance over the standard 16-bit baseline as well.

An 8 Billion parameter model in 4-bit (0.5 bytes per parameter) would take about 4 GB of memory, then additional memory on top of that for processing text/requests/chatting, but 8GB should be sufficient. The biggest limitation would probably be the bandwidth of even fast LPDDR5X memory (the entirety of the model weights must be read for every token generated, although the maximum theoretical speed for a 64-bit LPDDR5X-8400 module iterating over a 4GB LLM would be acceptable—roughly 17 tokens/s), then possibly power consumption.

#25

kondamin

Would have been nice if intel hadnt duped 3dxpoint so a TB of that could be put on a 5090

Add your own comment

NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

31 Comments on NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters