Member-only story

Llama 3.1 INT4 Quantization: Cut Costs by 75% Without Sacrificing Performance!

Agent Issue
3 min readAug 14, 2024

--

This is a very important news for LLM practitioners, who have been working with large language models across various business and product use-cases.

Neuralmagic Team just hit a major milestone. They successfully quantized all the Llama 3.1 models to INT4, and here’s where it gets really exciting — this includes the massive 405B and 70B models, both of which are now rocking ~100% accuracy!

Llama 3.1 benchmarks against GPT-4, Gemma 2 and Claude 3.5 Sonnet

405B model, a beast that usually requires two 8x80GB GPU nodes, now can be run on a single server with just 4 GPUs — whether you’re working with A100s or H100s. That’s ~4x reduction in deployment cost! For those of us who’ve had to justify the resource demands of these large models, this is a huge win!

So, what’s the magic behind this? The team managed to quantize the weights of the Meta-Llama-3.1 models (405B, 70B, and 8B) to the INT4 data type. By shrinking the number of bits per parameter from 16 to 4, they managed to slashed the disk size and GPU memory requirements by about 75% without comprimising performance.

Llama 3.1: Tool use and multi-lingual agents

INT4 Performance

--

--

Agent Issue
Agent Issue

Written by Agent Issue

Your front-row seat to the future of Agents.

Responses (6)