6 min read

Quantization, distillation, pruning: making models fit

Three ways to shrink an LLM. Quantization (Q2-Q8 with K-quants in GGUF), distillation (teacher to student), pruning. Why Q4_K_M is the community default and what each lever costs.

Quantization, distillation, pruning: making models fit

A 70B-parameter model at full precision is 140GB of weights. No consumer hardware loads that. Yet a 70B model runs on a 64GB MacBook Pro and a 24GB RTX 4090. Three techniques close that gap, and one of them does almost all the work.

This is post 3 of 13 in the Local LLMs series. After the vocabulary post, you can read a model name. After this one, you'll know exactly how that 70B model file shrunk to 35GB on disk and what the trade-off was.

The full-precision baseline

Three compression levers

When a model is trained, every weight is a 32-bit floating point number (FP32). 4 bytes each. So a 7B model at FP32 is 28GB; a 70B is 280GB. Training is done at FP32 (or BF16, a 16-bit variant) for numerical stability.

For inference, you don't need that precision. The forward pass works fine on lower-precision numbers if you do the conversion carefully. That's the entire game.

The three levers, in order of how much they matter:

  1. Quantization. Use fewer bits per weight. Same weights, lower precision.
  2. Distillation. Train a smaller model to mimic a larger one. Different weights, fewer of them.
  3. Pruning. Delete weights that don't contribute. Same precision, fewer weights.

In 2026, quantization is doing 95% of the work in the open-weights ecosystem. Distillation is doing about 4%. Pruning is doing about 1%. We'll cover all three, but quantization is where you spend your attention.

Quantization, in detail

The basic idea: take a number stored in 16 bits (range about ±65,000) and squeeze it into 4 bits (range about ±8). You lose precision. You don't lose much capability, because LLM weights are noisy and most of the precision was wasted.

A 7B model:

  • FP16: 7B × 2 bytes = 14 GB
  • Q8: 7B × 1 byte = 7 GB
  • Q4: 7B × 0.5 bytes = 3.5 GB
  • Q2: 7B × 0.25 bytes = 1.75 GB

Same model, four sizes, four quality levels. The art is picking the right level for what you can run.

What "Q4_K_M" means

llama.cpp's GGUF format defines a family of quantization schemes. The naming has three parts:

  • Q[N] , N bits per weight on average. Q4 = 4 bits.
  • _K , K-quants. A smarter scheme than the original quantization. Always pick K-quants over the legacy ones.
  • _S / _M / _L , small, medium, large. Different layers get different bit-widths to preserve quality where it matters most. _M is the standard default; _L is slightly bigger and sharper; _S is slightly smaller and rougher.

The practical defaults:

  • Q4_K_M. The community standard. Good quality, small file. 90% of local users want this.
  • Q5_K_M. When you have spare VRAM and want a small quality bump.
  • Q8_0. When you have lots of VRAM and want near-perfect quality. Almost identical to FP16.
  • Q3_K_S or Q2_K. Last-resort sizes when the next quant up doesn't fit. Quality drops noticeably.

What you actually lose

Quality loss with quantization is real but smaller than you'd guess. Rough rules:

  • Q8 vs FP16: indistinguishable in chat. Benchmarks within margin of error.
  • Q5 vs FP16: tiny degradation. Catches up after one extra try on tricky prompts.
  • Q4 vs FP16: noticeable on hard reasoning, invisible on chat and drafting.
  • Q3 vs FP16: feels dumber. Wrong answers on math and code go up.
  • Q2 vs FP16: substantial quality loss. The model still talks, but it's worse at everything.

The practical sweet spot is Q4_K_M. The community converged on it because it's where the curve goes flat: smaller files cost a lot of quality, bigger files cost a lot of memory for not much quality gain.

KV cache quantization

Bonus lever: you can also quantize the KV cache, not just the weights. We hit this in detail in post 4, but worth flagging now: enabling Q8 KV cache halves your context-memory footprint with almost no quality cost. Most local-LLM beginners leave this off and run out of memory at long contexts they could have run.

Distillation, in detail

A different approach. Instead of compressing a big model, you train a small model to behave like the big one.

The mechanic: run a billion prompts through the teacher (a large, slow, expensive model). Record its outputs. Train a smaller student model to produce the same outputs. The student ends up with a fraction of the parameters but most of the teacher's behavior.

Where you see this in 2026:

  • DeepSeek-R1-Distill-Llama-70B / DeepSeek-R1-Distill-Qwen-32B. A small model fine-tuned on DeepSeek R1's reasoning traces. Reasoning capability that punches well above the parameter count.
  • Phi family (Microsoft). Small models trained on synthetic data generated by larger models. Phi-3 mini at 3.8B parameters competes with much larger models on many tasks.
  • Gemma 2 (Google). Distilled from Gemini's larger models. Punches above weight class.

Distillation is what made the 2026 small-model renaissance possible. A 3B model in 2024 was useful for autocomplete; a 3B model in 2026, distilled from a frontier teacher, is useful for real coding. The size didn't change. The training data did.

What you give up: you can't distill capabilities the teacher didn't have, and the small model often loses some long-context coherence the teacher had. But for the size, distilled small models are extraordinary.

Pruning, in detail

The least-used of the three. The idea: identify weights in the model that aren't doing useful work, and delete them.

There are two flavors:

  • Unstructured pruning. Set individual weights to zero based on magnitude. Doesn't shrink the file, but creates sparsity that some hardware can exploit. Mostly research.
  • Structured pruning. Remove whole rows/columns of weight matrices, or whole attention heads, or whole layers. The file actually gets smaller. Risky: deleting the wrong head can tank accuracy.

In practice, pruning hasn't taken off in the open-weights ecosystem. The math is harder than quantization and the gains are smaller. You'll see "pruned" tags occasionally on HuggingFace, but they're a footnote.

If you're picking a model, you don't need to think about pruning. If you're researching efficient inference, you do.

Combining the three

The three techniques compose. A 2026 production pipeline often:

  1. Trains a giant teacher model (Llama 3.1 405B or DeepSeek V4).
  2. Distills to a smaller student (Llama 3.2 3B).
  3. Quantizes the student (Q4_K_M GGUF).
  4. Optionally prunes (rare).

The end product on your laptop is a 2GB GGUF file that traces back to a frontier teacher. Every step lost something; together they got you a working model that fits.

Picking a quantization for your hardware

Practical algorithm:

  1. Start with Q4_K_M. The community default exists for a reason.
  2. If the model file plus 1–2 GB headroom (KV cache, runtime overhead) fits comfortably in your VRAM, try Q5_K_M for a small quality bump.
  3. If it doesn't fit at Q4, drop to Q3_K_S. If that doesn't fit either, you need a smaller model, not a smaller quant.
  4. Q8 only when you have huge VRAM headroom. The quality gain over Q5 is usually invisible.

Don't agonize. Pick Q4_K_M, run for a week, switch only if you have a specific complaint.

What's next

Now you understand how the model file got small. The next post is what happens when you actually ask it to generate text: streaming, throughput, time-to-first-token, and the KV cache that makes the second token cheap.

From the dictionary

Terms used in this post

Quick reference for the 18 terms you met above. Each one comes from the AI dictionary.

AlgorithmML
In ML, the recipe used to turn data into a model — the architecture plus the training procedure. Different algorithms (decision trees, gradient-boosted trees, neural networks, transformers) produce different model types.
AttentionDL
The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
Context WindowNLP
The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
DatasetData
The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
GeminiAI
Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GGUFML
GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
InferenceML
Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI
A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LossML
A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML
The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML
Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML
The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
VRAMGeneral
Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML
The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...