• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

NVIDIA Fine-Tunes Llama3.1 Model to Beat GPT-4o and Claude 3.5 Sonnet with Only 70 Billion Parameters

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,626 (0.98/day)
NVIDIA has officially released its Llama-3.1-Nemotron-70B-Instruct model. Based on META's Llama3.1 70B, the Nemotron model is a large language model customized by NVIDIA in order to improve the helpfulness of LLM-generated responses. NVIDIA uses fine-tuning structured data to steer the model and allow it to generate more helpful responses. With only 70 billion parameters, the model is punching far above its weight class. The company claims that the model is beating the current top models from leading labs like OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet, which are the current leaders across AI benchmarks. In evaluations such as Arena Hard, the NVIDIA Llama3.1 Nemotron 70B is scoring 85 points, while GPT-4o and Sonnet 3.5 score 79.3 and 79.2, respectively. Other benchmarks like AlpacaEval and MT-Bench spot NVIDIA also hold the top spot, with 57.6 and 8.98 scores earned. Claude and GPT reach 52.4 / 8.81 and 57.5 / 8.74, just below Nemotron.

This language model underwent training using reinforcement learning from human feedback (RLHF), specifically employing the REINFORCE algorithm. The process involved a reward model based on a large language model architecture and custom preference prompts designed to guide the model's behavior. The training began with a pre-existing instruction-tuned language model as the starting point. It was trained on Llama-3.1-Nemotron-70B-Reward and HelpSteer2-Preference prompts on a Llama-3.1-70B-Instruct model as the initial policy. Running the model locally requires either four 40 GB or two 80 GB VRAM GPUs and 150 GB of free disk space. We managed to take it for a spin on NVIDIA's website to say hello to TechPowerUp readers. The model also passes the infamous "strawberry" test, where it has to count the number of specific letters in a word, however, it appears that it was part of the fine-tuning data as it fails the next test, shown in the image below.




View at TechPowerUp Main Site | Source
 
Joined
May 22, 2024
Messages
412 (2.01/day)
System Name Kuro
Processor AMD Ryzen 7 7800X3D@65W
Motherboard MSI MAG B650 Tomahawk WiFi
Cooling Thermalright Phantom Spirit 120 EVO
Memory Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s) PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage Crucial T500 2TB + WD Blue 8TB
Case Lian Li LANCOOL 216
Power Supply MSI MPG A850G
Software Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores 17761 C23 Multi@65W
FWIW you can try the thing on your own PC, quantized, with Ollama/llama.cpp and a reasonably new setup with >=64GB of RAM. 32GB RAM + 16GB VRAM might also work with GPU offload, and a newer NVIDIA video card with the proper CUDA setup could speed up prompt processing to something tolerable, but generation would still be slow, unless you have that much memory in pure VRAM.

65B-70B model size had been where things start to get really interesting, and the largest model a good home PC could locally run without excessively damaging quantization. They can be fun to play around with, and put quite a few things into perspective.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,626 (0.98/day)
FWIW you can try the thing on your own PC, quantized, with Ollama/llama.cpp and a reasonably new setup with >=64GB of RAM. 32GB RAM + 16GB VRAM might also work with GPU offload, and a newer NVIDIA video card with the proper CUDA setup could speed up prompt processing to something tolerable, but generation would still be slow, unless you have that much memory in pure VRAM.

65B-70B model size had been where things start to get really interesting, and the largest model a good home PC could locally run without excessively damaging quantization. They can be fun to play around with, and put quite a few things into perspective.
IIRC 4 bit quants should allow it to run on PC(high end). Id love to one day run model of this size locally! :)
 
Joined
Jun 22, 2012
Messages
301 (0.07/day)
Processor Intel i7-12700K
Motherboard MSI PRO Z690-A WIFI
Cooling Noctua NH-D15S
Memory Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s) MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case Fractal Define C
Power Supply Corsair RM850x
Mouse Logitech G203
Software openSUSE Tumbleweed
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.
 

AleksandarK

News Editor
Staff member
Joined
Aug 19, 2017
Messages
2,626 (0.98/day)
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.
There are options to use threadripper+128GB RAM. That should cost less than two GPUs :)
 
Joined
May 22, 2024
Messages
412 (2.01/day)
System Name Kuro
Processor AMD Ryzen 7 7800X3D@65W
Motherboard MSI MAG B650 Tomahawk WiFi
Cooling Thermalright Phantom Spirit 120 EVO
Memory Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s) PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage Crucial T500 2TB + WD Blue 8TB
Case Lian Li LANCOOL 216
Power Supply MSI MPG A850G
Software Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores 17761 C23 Multi@65W
IIRC 4 bit quants should allow it to run on PC(high end). Id love to one day run model of this size locally! :)
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.
I could only say there is a difference between "acceptable speed for production purposes" and "acceptable speed for some little experiment that takes the time of a cup of tea for each prompt." RAM is probably the only way to do it at reasonable cost in consumer PC right now. 2x3090 or 2x4090 for this sole purpose would be rather extravagant.
 
Joined
Jan 3, 2021
Messages
3,562 (2.48/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
There are options to use threadripper+128GB RAM. That should cost less than two GPUs :)
"acceptable speed for some little experiment that takes the time of a cup of tea for each prompt."
Even a Ryzen 8700G with 128 GB RAM might be somewhat usable (depending on user's tolerance to giga-amounts of tea/coffee/beer/popcorn/peanuts/pizza).
 

Jun

Joined
May 6, 2022
Messages
66 (0.07/day)
System Name Alpha
Processor AMD Ryzen 7 5800X3D [PBO2 tuner -30 all cores]
Motherboard GIGABYTE B550I AORUS PRO AX (rev. 1.0)
Cooling ekwb EK-AIO 240 D-RGB
Memory Trident Z Neo DDR4-3600 CL16 32GB GTZN [15-15-15-35 3800MHz@1.45V]
Video Card(s) INNO3D GEFORCE RTX 3080 TI X3 OC [2010MHz@993mV, +1300MHz]
Storage Kingston FURY Renegade 2TB
Display(s) Samsung Odyssey G7 32” // ASUS ROG Strix XG16AHP
Case Lian Li A4-H2O
Audio Device(s) CREATIVE Sound BlasterX G6 // polk MagniFi Mini //SHURE SE846 //steelseries Arctis Nova Pro Wireless
Power Supply SilverStone SX750 Platinum V1.1
Mouse Keychron M4
Keyboard Keychron K3 Max
Software Microsoft Windows 11 Pro
It passes the strawberry test unlike GPT and Bing, but no one told it about the techpowerup test.
 
Joined
Feb 10, 2010
Messages
103 (0.02/day)
Location
Thailand
System Name amy-pc
Processor ryzen 5 2600
Motherboard asus a320m-k
Cooling stock cpu fan
Memory 16gb(8*2) bus 3200
Video Card(s) msi rx560 4gb
Storage wd black 500gb sn750 nvme, 2x120gb apacer sata (raid0), 8tb nas synology ds220j
Display(s) msi optix g24 series, freesync 75hz
Audio Device(s) nubwo southpaw ns-12
Power Supply cooler master 550w
Mouse g102
Keyboard philips spk8901
Software windows 11 insider
Up - 1 R

Yep, it works fine.
It seems not understand common sense. And need to learn.
1729158618529.png



1729158690971.png
 
Joined
Jan 3, 2021
Messages
3,562 (2.48/day)
Location
Slovenia
Processor i5-6600K
Motherboard Asus Z170A
Cooling some cheap Cooler Master Hyper 103 or similar
Memory 16GB DDR4-2400
Video Card(s) IGP
Storage Samsung 850 EVO 250GB
Display(s) 2x Oldell 24" 1920x1200
Case Bitfenix Nova white windowless non-mesh
Audio Device(s) E-mu 1212m PCI
Power Supply Seasonic G-360
Mouse Logitech Marble trackball, never had a mouse
Keyboard Key Tronic KT2000, no Win key because 1994
Software Oldwin
It passes the strawberry test unlike GPT and Bing, but no one told it about the techpowerup test.
Beware of "reinforcement learning from human feedback (RLHF)". Tell it several times there are 18 Rs in "up", and it will tell you back that there are ~19 in TechPowerUp.
 
Joined
Jan 14, 2019
Messages
12,479 (5.78/day)
Location
Midlands, UK
System Name Nebulon B
Processor AMD Ryzen 7 7800X3D
Motherboard MSi PRO B650M-A WiFi
Cooling be quiet! Dark Rock 4
Memory 2x 24 GB Corsair Vengeance DDR5-4800
Video Card(s) AMD Radeon RX 6750 XT 12 GB
Storage 2 TB Corsair MP600 GS, 2 TB Corsair MP600 R2
Display(s) Dell S3422DWG, 7" Waveshare touchscreen
Case Kolink Citadel Mesh black
Audio Device(s) Logitech Z333 2.1 speakers, AKG Y50 headphones
Power Supply Seasonic Prime GX-750
Mouse Logitech MX Master 2S
Keyboard Logitech G413 SE
Software Bazzite (Fedora Linux) KDE
So it writes you an essay when you ask something simple that could be answered with a single number. Um... great? :wtf:

I was taught in school that using longer sentences than necessary to convey meaning lowers the quality of any scientific work. Just sayin'.

Beware of "reinforcement learning from human feedback (RLHF)". Tell it several times there are 18 Rs in "up", and it will tell you back that there are ~19 in TechPowerUp.
This is exactly why AI is useless in the hands of critical thinkers, and dangerous in the hands of others.
 
Joined
Jun 29, 2023
Messages
107 (0.20/day)
Until LLMs are reliable to a very small epsilon, they are a gadget and a novelty. If you need an expert to check any output for correctness, it his hard to see where the progress is.
 
Joined
Jan 14, 2019
Messages
12,479 (5.78/day)
Location
Midlands, UK
System Name Nebulon B
Processor AMD Ryzen 7 7800X3D
Motherboard MSi PRO B650M-A WiFi
Cooling be quiet! Dark Rock 4
Memory 2x 24 GB Corsair Vengeance DDR5-4800
Video Card(s) AMD Radeon RX 6750 XT 12 GB
Storage 2 TB Corsair MP600 GS, 2 TB Corsair MP600 R2
Display(s) Dell S3422DWG, 7" Waveshare touchscreen
Case Kolink Citadel Mesh black
Audio Device(s) Logitech Z333 2.1 speakers, AKG Y50 headphones
Power Supply Seasonic Prime GX-750
Mouse Logitech MX Master 2S
Keyboard Logitech G413 SE
Software Bazzite (Fedora Linux) KDE
Until LLMs are reliable to a very small epsilon, they are a gadget and a novelty. If you need an expert to check any output for correctness, it his hard to see where the progress is.
It will never be reliable because
1. It is trained by unreliable humans, and
2. The more LLM-generated answers form the basis of future answers, the more errors any new answer will produce.
 
Joined
Apr 13, 2022
Messages
1,185 (1.22/day)
I could only say there is a difference between "acceptable speed for production purposes" and "acceptable speed for some little experiment that takes the time of a cup of tea for each prompt." RAM is probably the only way to do it at reasonable cost in consumer PC right now. 2x3090 or 2x4090 for this sole purpose would be rather extravagant.
It's not extravagant. The point of the 3090 and the 4090 is not gaming but stuff like this. Granted the people that do stuff like this do far more than just this. But this is less of a waste of a GPU than gaming is.
 
Joined
Jun 22, 2012
Messages
301 (0.07/day)
Processor Intel i7-12700K
Motherboard MSI PRO Z690-A WIFI
Cooling Noctua NH-D15S
Memory Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s) MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case Fractal Define C
Power Supply Corsair RM850x
Mouse Logitech G203
Software openSUSE Tumbleweed
It's not extravagant. The point of the 3090 and the 4090 is not gaming but stuff like this. Granted the people that do stuff like this do far more than just this. But this is less of a waste of a GPU than gaming is.

For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: https://www.reddit.com/r/LocalLLaMA/

1729163068032.png
 
Joined
Mar 14, 2024
Messages
12 (0.04/day)
For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: https://www.reddit.com/r/LocalLLaMA/

View attachment 367943
Some of those can get well into the "do you really need all that?" levels. I remember quite well a 10 x 3090 build that once showed up in there, made me feel extremely poor with "only" a dual 3090 build. That thing looked like we went back to the gpu mining era.

Edit: this one right here.
 
Joined
Jul 24, 2024
Messages
263 (1.87/day)
System Name AM4_TimeKiller
Processor AMD Ryzen 5 5600X @ all-core 4.7 GHz
Motherboard ASUS ROG Strix B550-E Gaming
Cooling Arctic Freezer II 420 rev.7 (push-pull)
Memory G.Skill TridentZ RGB, 2x16 GB DDR4, B-Die, 3800 MHz @ CL14-15-14-29-43 1T, 53.2 ns
Video Card(s) ASRock Radeon RX 7800 XT Phantom Gaming
Storage Samsung 990 PRO 1 TB, Kingston KC3000 1 TB, Kingston KC3000 2 TB
Case Corsair 7000D Airflow
Audio Device(s) Creative Sound Blaster X-Fi Titanium
Power Supply Seasonic Prime TX-850
Mouse Logitech wireless mouse
Keyboard Logitech wireless keyboard
Up - 1 R

Yep, it works fine.
That's it, this is Jensen's golden-egg-laying duck "AI". Dumb as f*ck. We are wasting electricity and money for accelerating this hallucinating "AI". What a waste of resources ...
 
Joined
May 10, 2023
Messages
304 (0.52/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
How much of a high-end PC do you mean? To run a 4-bit, 70 billion parameters LLM at acceptable speeds you'd still need 48GB of VRAM, which implies 2×24GB consumer-tier GPUs.
Yeah, I have run llama 70b with my 2x3090 (vram usage is closer to 35~40GB fwiw).
If the 5090 really comes with 32GB, those 64GB with a pair will be reall awesome.
 
Joined
Mar 14, 2024
Messages
12 (0.04/day)
Yeah, I have run llama 70b with my 2x3090 (vram usage is closer to 35~40GB fwiw).
If the 5090 really comes with 32GB, those 64GB with a pair will be reall awesome.
64GB is the dream, but if the 5090 comes out with the price and TDP that are being rumored, the trend will stay on "just hoard more 3090s". If a single 5090 ends up being the $2500 some people are claiming, you can get 3 3090s and 72GB for that amount, maybe even 4 if you really scour the used market. Or at least you can, assuming it doesn't pull a P40 and the used prices jump to double of what they were before people found out about getting used Tesla cards.
 
Joined
May 22, 2024
Messages
412 (2.01/day)
System Name Kuro
Processor AMD Ryzen 7 7800X3D@65W
Motherboard MSI MAG B650 Tomahawk WiFi
Cooling Thermalright Phantom Spirit 120 EVO
Memory Corsair DDR5 6000C30 2x48GB (Hynix M)@6000 30-36-36-76 1.36V
Video Card(s) PNY XLR8 RTX 4070 Ti SUPER 16G@200W
Storage Crucial T500 2TB + WD Blue 8TB
Case Lian Li LANCOOL 216
Power Supply MSI MPG A850G
Software Ubuntu 24.04 LTS + Windows 10 Home Build 19045
Benchmark Scores 17761 C23 Multi@65W
For what it's worth, there's a relatively sizable amount of people with multiple high-end GPU rigs for what is essentially "LLM entertainment" (text-based interactive roleplay, etc).

Some can be found here: https://www.reddit.com/r/LocalLLaMA/

View attachment 367943
I had no idea people were doing that, but I could see why. Quite a lot of those roleplays would have been unprintably intense one way or another, and won't do with the ToS of almost any mass-market model behind API, which are not specifically finetuned to do well in this use case anyway.

It is somewhere no one had gone before until very recently, too; I don't think many could do that kind of roleplay with the level of responsiveness, variety, and apparent quality a GPU-hosted roleplay-finetuned LLM would provide. Play it like a holodeck, or even play it like the Matrix where you controlled everything. Yep, could see the appeal there.

But then again, many of the roleplay finetunes are also quite excellent for light scene writing and bouncing ideas off of and see how it sticks. They are already fit for use where no mathematics-level correctness nor absolute fidelity is required.

For that matter they are good for saving developer time too - sometimes all you need is a pointer to the right direction, and LLMs are good for that, even if the specific details might not be right.
 
Joined
Jun 21, 2021
Messages
3,121 (2.46/day)
System Name daily driver Mac mini M2 Pro
Processor Apple proprietary M2 Pro (6 p-cores, 4 e-cores)
Motherboard Apple proprietary
Cooling Apple proprietary
Memory Apple proprietary 16GB LPDDR5 unified memory
Video Card(s) Apple proprietary M2 Pro (16-core GPU)
Storage Apple proprietary onboard 512GB SSD + various external HDDs
Display(s) LG UltraFine 27UL850W (4K@60Hz IPS)
Case Apple proprietary
Audio Device(s) Apple proprietary
Power Supply Apple proprietary
Mouse Apple Magic Trackpad 2
Keyboard Keychron K1 tenkeyless (Gateron Reds)
VR HMD Oculus Rift S (hosted on a different PC)
Software macOS Sonoma 14.7
Benchmark Scores (My Windows daily driver is a Beelink Mini S12 Pro. I'm not interested in benchmarking.)
Read the original article (post #1). It explains what the headline states. This is typical in expository writing.

What is implied but not explicitly stated is that whittling down LLaMa 3.1 to a 70 billion parameter model means that Nvidia improved on performance-per-watt and performance-per-dollar metrics by reducing the hardware requirements necessary to run an AI chatbot that compares favorably to GPT-4o and Claude 3.5 Sonnet.

Why is this important? One achievement that many are striving to attain is putting a useful model on a handheld device like a smartphone. I don't know if these model sizes scale linearly but perhaps a 8-10 billion parameter model could fit on a modern premium smartphone (8GB RAM).

AI adoption will take off once it runs well (and practically) on a smartphone, the primary computing modality of consumers in 2024.

Running an LLM AI chatbot on some PC loaded with two 24GB GPUs or pumped up with 128GB system RAM is not where consumer innovation is going to happen. It will happen in the handheld sector.
 
Joined
May 10, 2023
Messages
304 (0.52/day)
Location
Brazil
Processor 5950x
Motherboard B550 ProArt
Cooling Fuma 2
Memory 4x32GB 3200MHz Corsair LPX
Video Card(s) 2x RTX 3090
Display(s) LG 42" C2 4k OLED
Power Supply XPG Core Reactor 850W
Software I use Arch btw
64GB is the dream, but if the 5090 comes out with the price and TDP that are being rumored, the trend will stay on "just hoard more 3090s". If a single 5090 ends up being the $2500 some people are claiming, you can get 3 3090s and 72GB for that amount, maybe even 4 if you really scour the used market. Or at least you can, assuming it doesn't pull a P40 and the used prices jump to double of what they were before people found out about getting used Tesla cards.
Yeah, at some point it may even be more worth to just rent out H100s to fine tune your models and then just use your local setup for inference. The model in question here is proof that a good finetune can take you a long way.

Getting more than 2 GPUs would be hard for me since I'm on a consumer platform that is limited in lanes/slots :(
Using an used/older server/WS platform would consume way too much power and be lacking in CPU perf, not to say the space required.

Doing wack jobs for splitting a single x8 slot into x4/x4 for more GPUs could be an option, but then the bandwidth limitation for training would still be annoying (not so much of an issue for inference tho).

But that's pretty much a rich-people-problem situation anyway haha

This is talking about LLMs (such as what's running behind ChatGPT), and how Nvidia created a model based on Meta's Llama (which is open source) with way less parameters than what's usual, but with amazing performance.
 
Joined
Jun 22, 2012
Messages
301 (0.07/day)
Processor Intel i7-12700K
Motherboard MSI PRO Z690-A WIFI
Cooling Noctua NH-D15S
Memory Corsair Vengeance 4x16 GB (64GB) DDR4-3600 C18
Video Card(s) MSI GeForce RTX 3090 GAMING X TRIO 24G
Storage Samsung 980 Pro 1TB, SK hynix Platinum P41 2TB
Case Fractal Define C
Power Supply Corsair RM850x
Mouse Logitech G203
Software openSUSE Tumbleweed
Why is this important? One achievement that many are striving to attain is putting a useful model on a handheld device like a smartphone. I don't know if these model sizes scale linearly but perhaps a 8-10 billion parameter model could fit on a modern premium smartphone (8GB RAM).

Smaller models that can fit within the memory of a modern high-end smartphone already exist, but they're nowhere as capable as larger ones—they have less attention to detail, less knowledge, less answer accuracy, can get off the rails/incoherent more easily, can't understand complex instructions, etc; additionally, quantization beyond a certain level (think of quantization like reducing the color depth of an image to save memory) can degrade performance over the standard 16-bit baseline as well.

An 8 Billion parameter model in 4-bit (0.5 bytes per parameter) would take about 4 GB of memory, then additional memory on top of that for processing text/requests/chatting, but 8GB should be sufficient. The biggest limitation would probably be the bandwidth of even fast LPDDR5X memory (the entirety of the model weights must be read for every token generated, although the maximum theoretical speed for a 64-bit LPDDR5X-8400 module iterating over a 4GB LLM would be acceptable—roughly 17 tokens/s), then possibly power consumption.
 
Last edited:
Top