April 7, 20266 min read

The local-LLM vocabulary

Parameters, B, dense vs MoE, base vs instruct, tokens, context windows, chat templates, GGUF, and quant suffixes. Read it once and any HuggingFace model card stops being scary.

The first time I opened a HuggingFace model card I bounced right off it. Walls of labels, suffixes, numbers with letters stuck on the end. But once you know maybe a dozen words, every card reads the same way: a few labels, a few numbers, and the rest is marketing. This post is those labels and numbers.

This is post 2 of 13 in the Local LLMs series. By the end, "Llama-3.1-8B-Instruct-Q4_K_M.gguf" will be a sentence you can read.

Parameters and "B"

A parameter is one number stored inside the model. A weight is the same idea, same family. A modern LLM has billions of them, which is where "8B" (8 billion) and "70B" (70 billion) come from. That count tells you two practical things.

First, how much disk and memory it needs. At 16-bit precision it's parameters times 2 bytes, so 8B at FP16 is roughly 16GB on disk. At 4-bit quantization it's parameters times 0.5 bytes, about 4GB.

Second, how smart it is, very roughly. A 70B model has about 9x more parameters than an 8B, so more room to encode patterns. The catch is that the ratio isn't linear with intelligence. A well-trained 8B beats a poorly-trained 70B. But inside the same model family, bigger is sharper.

So when you see "Llama 3.2 3B," the 3B is the parameter count. That's the headline number. Everything else is detail.

Dense vs Mixture-of-Experts

This is the newer split, and it trips people up. Models come in two architectures.

Dense means every parameter fires on every token. Llama 3.x 8B is dense, so all 8B parameters get computed for every token you generate. Simple, predictable, runs anywhere.

Mixture-of-Experts (MoE) is different. The model holds many "experts," each one a sub-network, and a router picks a small subset for each token. Total parameters can be enormous, but only a fraction (the "active" parameters) actually computes per token.

That's why you'll see two numbers on MoE models. DeepSeek V4: 1.6T total, 49B active. Llama 4 Maverick: 400B, 17B. Qwen 3.5 235B-A22B: 235B total, 22B active. The first number is your file size. The second is the cost per token.

For running things locally, dense is usually what you want. MoE files are huge to download and need a lot of RAM and VRAM just to load, even though the inference itself is cheap. A 14B dense model fits easily on a 16GB GPU. A 200B MoE doesn't, no matter how few active parameters it claims.

Base vs instruct

Every released model exists in two versions, and picking the wrong one is a classic rookie mistake.

Base is trained only on next-token prediction. It completes text, nothing more. Ask it "What is 2+2?" and it might fire back another question, or just continue with the next sentence from some textbook. Base models are for research and fine-tuning, not chat.

Instruct (sometimes "Chat" or "IT") is the base model trained further to follow instructions and hold a conversation. This is what you want 99% of the time.

The filename usually tells you: Llama-3.1-8B-Instruct, Qwen2.5-3B-Instruct. No suffix? Check the card. Reaching for a base model when you wanted instruct is the most common "why is this thing ignoring my prompt" mistake new users run into.

Tokens and the tokenizer

A token is the unit a model reads and writes. Not a character, not a word, something in between. For English, the rough conversions are worth memorizing:

1 token ≈ 4 characters
1 token ≈ 0.75 of a word
1,000 tokens ≈ 750 words ≈ a page
1M tokens ≈ a 750-page book

The thing doing the chopping is the model's tokenizer. Different models use different ones (BPE, SentencePiece, tiktoken), so the same input text can come out to different token counts. Hindi and Devanagari scripts often tokenize 2 to 3x worse than English on most tokenizers, which means the same Hindi paragraph costs more tokens than its English twin.

When you compare bills, context lengths, or output speeds, the number that matters is always tokens. Never characters, never words.

Context window

The context window is the maximum number of tokens the model can see at once. Think of it as the model's working memory for a single conversation.

Common sizes in 2026:

4K (4,096 tokens). Old. Fits a few messages.
8K. Llama 3 base. Workable for short chat.
32K. Comfortable for medium documents.
128K. The 2026 default for serious open models.
1M. Llama 4 Scout, Gemini 3.1 Pro. Whole codebases.
10M. Llama 4 Maverick experimental. A small library.

Bigger context costs memory (more on this when we hit the KV cache in a later post). On local hardware you usually pick a context that fits your VRAM, not the model's stated maximum.

Chat template

You don't usually write these by hand. Ollama, LM Studio, and llama.cpp's server endpoint handle the template for you. But "wrong chat template" is the second most common "why is this not working," and it's usually someone running a base model with an instruct template, or the other way around.

Quantization (preview)

This is the shorthand stuck on the end of every GGUF filename: Q4_K_M, Q5_K_S, Q8_0. The "Q" plus a number is the quantization level, meaning bits per parameter.

Q8: 8 bits per parameter. Almost no quality loss vs full precision.
Q5/Q4: 5 or 4 bits. The sweet spot for local. Tiny quality drop.
Q3/Q2: 3 or 2 bits. Noticeably degraded. Only when you must.

The next post is entirely about quantization, plus its cousins distillation and pruning. For now, read the suffix as "how hard the file was compressed," and reach for Q4_K_M as a safe default.

GGUF

GGUF is the file format that holds the model. One file with the weights, the tokenizer, the metadata, the chat template, all of it. Fully self-contained. You could email a model to a friend, assuming you've got gigabit upload.

It's a llama.cpp invention, and almost every local-LLM tool reads it. When you run ollama pull on a model, the file sitting behind the scenes is a GGUF.

Reading a model name

GGUF filename anatomy

Now put it together: Llama-3.1-8B-Instruct-Q4_K_M.gguf

Llama: model family (Meta).
3.1: version.
8B: 8 billion parameters, dense.
Instruct: instruction-tuned, chat-ready.
Q4_K_M: 4-bit quantized, K-quants, M (medium) variant. A good default.
.gguf: the file format.

Translation: "Meta's Llama 3.1, the 8-billion-parameter dense version, fine-tuned for chat, compressed to 4 bits using the medium-quality K-quant scheme, in the standard local-LLM file format."

That one sentence is the entire metadata. Once you can read it, you can shop on HuggingFace without flinching.

A few more terms you'll see

Context length / n_ctx. The runtime knob that controls how big a context window you actually allocate. Often smaller than the model's maximum, to save memory.
Temperature. A dial from 0 to 2 that controls randomness. 0 is deterministic, 1 is default, 2 is chaos. Default 0.7 is fine for most chat.
System prompt. Instructions you send before the conversation starts that shape how the model behaves. "You are a helpful coding assistant who answers in Python."
Stop tokens. Tokens that halt output when generated. The chat template defines them. Get this wrong and a model sometimes runs past its turn and starts writing a fake user message.
Embedding. A different kind of model. It turns text into a vector of numbers for similarity search. We get to these in post 10 (RAG).

What's next

The words are sorted. Next post zooms into the part most local-LLM users hand-wave straight past: how a 70B model file gets small enough to sit on a 24GB GPU. Three techniques (quantization, distillation, pruning), three trade-offs.

AI GGUF LLM Local Llms Vocabulary

From the dictionary

Terms used in this post

Quick reference for the 22 terms you met above. Each one comes from the AI dictionary.

Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.; e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
LossML: A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Next-Token PredictionNLP: The training objective of every modern LLM: given a sequence of tokens so far, predict the most likely next token. Run this in a loop and you get ChatGPT.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP: A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

Local Llms

2 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
LossML: A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Next-Token PredictionNLP: The training objective of every modern LLM: given a sequence of tokens so far, predict the most likely next token. Run this in a loop and you get ChatGPT.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP: A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

Local Llms

2 / 13 posts

Browse all in Local Llms →