← Blog

Context Window

NLP

/dictionary/context-window

Definition

The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.

Posts that use this term

  • Troubleshooting local LLMs and keeping up

    The catalog of common local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, RAG miss, tool-call hallucination. Plus where to follow the field as it moves.

  • Local RAG and embeddings

    A complete local RAG pipeline in 30 lines: nomic-embed-text for embeddings, Chroma for the vector DB, Llama 3.2 for the chat model. Why local RAG often beats cloud RAG for personal knowledge bases.

  • Every machine can run a local LLM (here's what fits)

    Per-tier guide: 8GB integrated graphics, 16GB MacBook Air, 8/12/16/24/32GB VRAM PCs, 24/32/64/128/192GB Macs. Specific models, specific tok/s, specific configs. Every tier runs something useful.

  • Picking a local model by task

    The 2026 open leaders by task: coding (Qwen 2.5 Coder, DeepSeek-Coder), chat (Llama, Qwen, Mistral), small-model renaissance (Phi-3, Gemma 2), structured output, multimodal, embeddings.

  • Streaming, throughput, and the KV cache

    TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.

  • Quantization, distillation, pruning: making models fit

    Three ways to shrink an LLM. Quantization (Q2-Q8 with K-quants in GGUF), distillation (teacher to student), pruning. Why Q4_K_M is the community default and what each lever costs.

  • The local-LLM vocabulary

    Parameters, B, dense vs MoE, base vs instruct, tokens, context window, chat template, GGUF, quantization suffixes. After this post you can read any HuggingFace model card.

  • The pitch for local LLMs in 2026

    Why every engineer should run a local LLM in 2026: privacy, zero marginal cost, lower latency, no rate limits, and offline. Even a 16GB MacBook Air runs Llama 3.2 3B at 30 tok/s.

  • What leaves your machine when you use AI

    What providers actually see, log, and retain when you call an LLM API in 2026. What 'we don't train on your data' really means, free vs paid tier differences, and when local is the only safe option.

  • LLM APIs and the economics of tokens

    How input vs output tokens are priced, why output is 5-6x more, what prompt caching saves you (10x), and the hidden costs (tokenizer drift, reasoning tokens, tool-call loops) that surprise people.

  • What it takes to run a model on your machine

    Why VRAM is the hard ceiling on local LLMs, what quantization actually does to a model file, and the practical hardware ladder from 8GB laptops to 192GB workstations.

  • The major LLMs in 2026

    A tour of the closed frontier models (Claude, GPT, Gemini) and the open weights (Llama, Qwen, DeepSeek, Mistral). What 'B' means, what each is good at, and which size to actually run.

  • Where AI actually runs: cloud, local, edge

    Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.

  • Prompt, RAG, fine-tune: three ways to shape a model

    Three levers for shaping what an LLM does: prompting (ask better), RAG (give it the right context), fine-tuning (change the weights). What each costs, what each fixes, and how to pick.

  • RAG: giving a model memory it doesn't have

    RAG is the pattern of fetching relevant text from a search system and putting it in the LLM's context window before asking your question. Not magic, not fine-tuning — just better prompts.

  • The context window, and why models hallucinate

    An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway — that's a hallucination, not a bug.

  • From models to LLMs

    An LLM is one kind of ML model — trained on text, predicts the next token. That single trick at scale gets you ChatGPT, and also explains where it breaks.