The local-LLM vocabulary
Parameters, B, dense vs MoE, base vs instruct, tokens, context window, chat template, GGUF, quantization suffixes. After this post you can read any HuggingFace model card.

A HuggingFace model card is intimidating until you know the words. Once you do, every card reads the same way: a few labels, a few numbers, and the rest is marketing. This post covers the labels and numbers.
This is post 2 of 13 in the Local LLMs series. By the end, "Llama-3.1-8B-Instruct-Q4_K_M.gguf" will be a sentence you can read.
Parameters and "B"
A parameter is one number stored inside the model. A "weight," in the same family. A modern LLM has billions of them, hence "8B" (8 billion) or "70B" (70 billion). The number tells you two practical things:
- How much disk and memory it needs. At 16-bit precision, parameters × 2 bytes. So 8B at FP16 is roughly 16GB on disk. At 4-bit quantization, parameters × 0.5 bytes , about 4GB.
- How smart it is, very roughly. A 70B model has about 9x more parameters than an 8B, so more capacity to encode patterns. The ratio is not linear with intelligence (a well-trained 8B beats a poorly-trained 70B), but within the same model family, bigger is sharper.
When you see "Llama 3.2 3B," the 3B is the parameter count. That's the headline number. Everything else is detail.
Dense vs Mixture-of-Experts
The newer split. Models come in two architectures:
- Dense. Every parameter fires on every token. Llama 3.x 8B is dense , all 8B parameters are computed for every token you generate. Simple, predictable, fits any runtime.
- Mixture-of-Experts (MoE). The model has many "experts," each a sub-network. A router picks a small subset for each token. Total parameters is huge, but only a fraction (the "active" parameters) computes per token.
You'll see two numbers on MoE models. DeepSeek V4: 1.6T total / 49B active. Llama 4 Maverick: 400B / 17B. Qwen 3.5 235B-A22B: 235B total, 22B active. The first number is the file size; the second is the cost per token.
For local use, dense is usually what you want. MoE files are huge to download and need lots of RAM/VRAM to load, even though the inference itself is cheap. A 14B dense model fits easily on a 16GB GPU; a 200B MoE doesn't, regardless of how few active parameters it has.
Base vs instruct
Every released model exists in two versions:
- Base. Trained only on next-token prediction. It completes text. Ask it "What is 2+2?" and it might respond with another question, or with the next sentence in a textbook. Base models are for research and fine-tuning, not chat.
- Instruct (or "Chat", "IT"). The base model further trained to follow instructions and have conversations. This is what you want 99% of the time.
Filenames usually include the suffix: Llama-3.1-8B-Instruct, Qwen2.5-3B-Instruct. If a model name has no suffix, check the card. Using a base model when you wanted instruct is the most common "why is this thing not following my prompt" mistake new users hit.
Tokens and the tokenizer
A token is the unit a model reads and writes. Not a character, not a word , a chunk in between. For English:
- 1 token ≈ 4 characters
- 1 token ≈ 0.75 of a word
- 1,000 tokens ≈ 750 words ≈ a page
- 1M tokens ≈ a 750-page book
The mapping is the model's tokenizer. Different models use different tokenizers (BPE, SentencePiece, tiktoken). Same input text, different token counts. Hindi or Devanagari scripts often tokenize 2–3x worse than English on most tokenizers, so the same Hindi paragraph costs more tokens than the same English paragraph.
The number that matters when you compare bills, context lengths, or output speeds is always tokens, never characters or words.
Context window
The maximum number of tokens the model can see at once. The model's "working memory" for a single conversation.
Common sizes in 2026:
- 4K (4,096 tokens). Old. Fits a few messages.
- 8K. Llama 3 base. Workable for short chat.
- 32K. Comfortable for medium documents.
- 128K. The 2026 default for serious open models.
- 1M. Llama 4 Scout, Gemini 3.1 Pro. Whole codebases.
- 10M. Llama 4 Maverick experimental. A small library.
Bigger context costs memory (more on this when we hit the KV cache in post 4). On local hardware, you usually pick a context that fits your VRAM, not the model's maximum.
Chat template
Each instruct model is trained to expect a specific format around messages. Llama uses one wrapper (<|begin_of_text|><|start_header_id|>user<|end_header_id|>...), Mistral uses another ([INST]...[/INST]), Qwen uses a third (<|im_start|>user...<|im_end|>).
You don't usually write these by hand , Ollama, LM Studio, and llama.cpp's server endpoint handle the template for you. But "wrong chat template" is the second most common "why is this not working" , usually because someone is using a base model with an instruct template, or vice versa.
Quantization (preview)
The shorthand at the end of every GGUF filename: Q4_K_M, Q5_K_S, Q8_0. The "Q" plus a number is "quantization level": the number of bits per parameter.
- Q8: 8 bits per parameter. Almost no quality loss vs full precision.
- Q5/Q4: 5 or 4 bits. The sweet spot for local. Tiny quality drop.
- Q3/Q2: 3 or 2 bits. Noticeably degraded. Use only when you must.
The next post is entirely about quantization (and its cousins, distillation and pruning). For now, treat the suffix as "how aggressively the file was compressed," and pick Q4_K_M as a safe default.
GGUF
The file format that holds the model. One file with weights, tokenizer, metadata, chat template, everything. Self-contained. You can email a model to a friend (assuming gigabit upload).
GGUF is a llama.cpp invention. Almost every local-LLM tool reads it. When you ollama pull a model, the file behind the scenes is a GGUF.
Reading a model name

Now the full sentence: Llama-3.1-8B-Instruct-Q4_K_M.gguf
Llama, model family (Meta).3.1, version.8B, 8 billion parameters, dense.Instruct, instruction-tuned, chat-ready.Q4_K_M, 4-bit quantized, K-quants, M (medium) variant. A good default..gguf, the file format.
Translation: "Meta's Llama 3.1, the 8-billion-parameter dense version, fine-tuned for chat, compressed to 4 bits using the medium-quality K-quant scheme, in the standard local-LLM file format."
That sentence is the entire metadata. Once you can read it, you can shop on HuggingFace.
A few more terms you'll see
- Context length /
n_ctx. The runtime parameter that controls how big a context window you actually allocate. Often smaller than the model's maximum to save memory. - Temperature. A knob from 0 to 2. Controls randomness. 0 = deterministic, 1 = default, 2 = chaos. Default 0.7 is fine for most chat.
- System prompt. Instructions you send before the conversation that shape how the model behaves. "You are a helpful coding assistant who answers in Python."
- Stop tokens. Tokens that, when generated, halt output. The chat template defines these. Misconfiguration is why a model sometimes runs past its turn into a fake user message.
- Embedding. A different kind of model , turns text into a vector of numbers for similarity search. We hit these in post 10 (RAG).
What's next
Now that the words are sorted, the next post zooms into the part most local-LLM users hand-wave: how a 70B model file gets small enough to fit on a 24GB GPU. Three techniques (quantization, distillation, pruning), three trade-offs.
From the dictionary
Terms used in this post
Quick reference for the 22 terms you met above. Each one comes from the AI dictionary.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GeminiAI
- Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
- e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
- GGUFML
- GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- llama.cppAI
- A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- LM StudioAI
- A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
- LossML
- A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- Next-Token PredictionNLP
- The training objective of every modern LLM: given a sequence of tokens so far, predict the most likely next token. Run this in a loop and you get ChatGPT.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- System PromptNLP
- A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- VRAMGeneral
- Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Troubleshooting local LLMs and keeping up
May 15, 2026
- 02
Fine-tuning a model locally
May 12, 2026
- 03
Local agents and tool use
May 8, 2026
- 04
Local RAG and embeddings
May 5, 2026
- 05
Integrating a local LLM into your workflow
May 1, 2026
- 06
Your first local LLM, end to end
April 28, 2026
- 07
Every machine can run a local LLM (here's what fits)
April 24, 2026
- 08
System requirements by OS for local LLMs
April 21, 2026
- 09
Picking a local model by task
April 17, 2026
- 10
Streaming, throughput, and the KV cache
April 14, 2026
- 11
Quantization, distillation, pruning: making models fit
April 10, 2026
- 12
The local-LLM vocabulary
April 7, 2026
- 13
The pitch for local LLMs in 2026
April 3, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...