Streaming, throughput, and the KV cache
TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.

When you watch a chat model type out a response one token at a time, that's not a UI animation. The model literally generates one token, then runs again to generate the next, then runs again. Streaming is just the choice to show you each token as it lands.
This is post 4 of 13 in the Local LLMs series. After this one, you'll know what time-to-first-token means, why streaming feels faster, and the single optimization (the KV cache) that makes generating a 1,000-token response cheaper than running the model 1,000 times.
How an LLM produces output

Inference happens in two phases:
- Prefill. The model reads your entire prompt at once. All the input tokens go through every layer of the model in parallel. The GPU eats this for breakfast , matrix multiplies are exactly what GPUs are for.
- Decode. The model generates one token. Then to generate the next token, it needs to "see" everything so far (your prompt plus the tokens it just wrote). Then again, and again, until it produces a stop token or hits a length limit.
Prefill is fast: thousands of tokens per second on consumer hardware, because it's parallel. Decode is slow: 30–120 tok/s on consumer hardware, because each token depends on the previous ones , you can't parallelize it.
Two speed metrics matter:
- Time-to-first-token (TTFT). How long from "I hit enter" to "first character appears." Dominated by prefill. For a short prompt this is milliseconds; for a 50K-token prompt it can be several seconds.
- Tokens per second (tok/s). How fast the model produces output during decode. This is the typing speed.
A reasonable Llama 3.2 3B run on a 16GB MacBook Air: TTFT around 100ms for a short prompt, decode around 35 tok/s. That's about as fast as a brisk human reader.
Why streaming feels faster
The same number of tokens, the same total time. But streaming starts showing output immediately, so:
- Perceived latency drops to TTFT. The user sees something happening within 100ms, even if the full response takes 8 seconds.
- The user can read while the model writes. Output and reading happen in parallel.
- Generation can be canceled mid-flight. If the answer is going wrong by token 50, the user kills it. With one-shot mode, you wait the whole 8 seconds and then realize.
The only reason to use one-shot output is when you need the full answer before doing anything (parsing JSON, running a tool call, sending the result somewhere). Otherwise, stream.
Both modes are first-class on every local runtime. Ollama returns streaming JSON by default. LM Studio's OpenAI-compatible server supports streaming via the stream: true flag. llama.cpp's --stream flag.
The KV cache, finally
Here's the question that has been hiding under the surface. To generate token #1000, the model needs to "see" tokens 1–999. So does it re-process all 999 tokens to generate token 1001? That would be quadratically expensive.
No. It uses the KV cache.
The mechanic: every layer of a transformer has a self-attention block. Self-attention computes three things from each input token: a Query (Q), a Key (K), and a Value (V). Each new token's Q is computed against every prior token's K and V to figure out what to "pay attention to."
The trick: K and V for past tokens never change. Once you've computed them, they're fixed. So you cache them. To generate token #1001, the model only computes Q for the new position and looks up K and V for everything before. The compute per new token is roughly constant, regardless of how many tokens came before.
Without the KV cache: token #1000 costs ~1000x token #1. Inference is unworkably slow. With the KV cache: token #1000 costs ~1x token #1, plus a small lookup overhead. Inference works.
Every modern LLM runtime has a KV cache. It is not optional. It is what makes long contexts and long outputs feasible.
What the KV cache costs
It's stored in memory. The size scales with:
- Context length (more tokens = more cached K and V).
- Model dimensions (more layers and bigger hidden size = more K and V per token).
- Batch size (each parallel request has its own cache).
A rough formula for a single sequence:
KV cache size ≈ 2 × layers × hidden_size × context_length × bytes_per_value
For Llama 3.1 8B at FP16, full 128K context: about 16 GB just for the KV cache. The weights are 16 GB; the cache can match them.
This is why a 16GB GPU can run an 8B model fine at 4K context but blows up at 128K. The KV cache, not the weights, is the wall.
Practical implications:
- Pick a context length you actually need. Most chat doesn't need 128K. Set
n_ctxto 8K or 16K and save gigabytes. - Watch RAM/VRAM usage as the conversation grows. The cache grows with every turn.
- When OOM hits at "long conversations" but not at startup, the KV cache is the cause.
KV cache quantization
Same trick as quantizing weights, applied to the cache. Q8 KV cache: 1 byte per value instead of 2. Q4 KV cache: 0.5 bytes.
The savings are huge. That same Llama 3.1 8B at 128K context drops from 16 GB KV to 8 GB at Q8 KV, or 4 GB at Q4 KV.
The quality cost is small for Q8, modest for Q4. Most users should turn on Q8 KV cache by default and not think about it again. Ollama and llama.cpp both support it; LM Studio exposes it in advanced settings.
# llama.cpp server with Q8 KV cache
./llama-server -m model.gguf -c 16384 --cache-type-k q8_0 --cache-type-v q8_0
Ollama's equivalent is the OLLAMA_KV_CACHE_TYPE=q8_0 environment variable.
The first time you turn this on and notice your 8B model running at 32K context where it used to die at 16K, you'll wonder why this isn't the default everywhere.
A worked example
You're chatting with Llama 3.1 8B Q4 on a 16GB MacBook Air. Setup:
- Model weights at Q4: ~5 GB
- Context length: 8K
- KV cache at FP16, 8K context: ~1 GB
- Runtime overhead: ~0.5 GB
- Total VRAM: ~6.5 GB, fits comfortably.
You crank context to 32K because someone said "more context is better":
- Weights: 5 GB (unchanged)
- KV cache at FP16, 32K: ~4 GB
- Total: ~9.5 GB, still fits.
You crank to 128K:
- KV cache at FP16, 128K: ~16 GB
- Total: ~21 GB. Doesn't fit. OOM.
You enable Q8 KV cache and try 128K again:
- KV cache at Q8, 128K: ~8 GB
- Total: ~13 GB. Fits.
Same model, same hardware. The configuration knob decided whether 128K context was possible.
TTFT on long prompts
Prefill is parallel but it's still proportional to prompt size. A 50K-token prompt has to flow through every layer once before the first output token can come out. On a 4090, that's around 1–2 seconds for a 7B model. On a MacBook Air, 4–8 seconds.
If your application has long prompts and needs fast first response, the workaround is prefix caching (some runtimes call it "prompt caching"): if the same long prefix appears at the start of two requests, cache it after the first prefill and skip the work the second time. Ollama supports this for repeat requests with stable system prompts. llama.cpp's --cache-reuse flag controls it.
This is the local-LLM equivalent of the prompt caching feature in cloud APIs (post 6 of the Running series, if you read that one).
What's next
You now understand how the model produces text and why the KV cache exists. The next post is the practical question: which model do you actually pick? Coding, chat, summarization, structured output, multimodal, embeddings , every task has a 2026 leader, and they're not all the same model.
From the dictionary
Terms used in this post
Quick reference for the 17 terms you met above. Each one comes from the AI dictionary.
- AttentionDL
- The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- llama.cppAI
- A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- LM StudioAI
- A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- Prompt CachingAI
- Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TransformerDL
- The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.
- VRAMGeneral
- Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Troubleshooting local LLMs and keeping up
May 15, 2026
- 02
Fine-tuning a model locally
May 12, 2026
- 03
Local agents and tool use
May 8, 2026
- 04
Local RAG and embeddings
May 5, 2026
- 05
Integrating a local LLM into your workflow
May 1, 2026
- 06
Your first local LLM, end to end
April 28, 2026
- 07
Every machine can run a local LLM (here's what fits)
April 24, 2026
- 08
System requirements by OS for local LLMs
April 21, 2026
- 09
Picking a local model by task
April 17, 2026
- 10
Streaming, throughput, and the KV cache
April 14, 2026
- 11
Quantization, distillation, pruning: making models fit
April 10, 2026
- 12
The local-LLM vocabulary
April 7, 2026
- 13
The pitch for local LLMs in 2026
April 3, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...