4 min read

The context window, and why models hallucinate

An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway — that's a hallucination, not a bug.

The context window, and why models hallucinate

Post 5 ended on a constraint: an LLM can only see so much at a time. That constraint has a name (the context window) and it's the single most important number to understand about any LLM you use.

This is post 6 of 8 in the Foundations series.

What the context window is

context window split: system prompt, chat history, retrieved context, question, response headroom

The context window is the maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation (system prompt, chat history, the document you pasted, your last question) has to fit inside that window. Anything outside is invisible.

Think of it as the model's working memory. Long-term memory lives in the trained weights (what the model learned during training). Short-term memory is the context window (what it can see right now).

Numbers as of early 2026:

A token is roughly 4 characters of English. So 200K tokens is about 150,000 words, roughly a long novel. 1M tokens is several novels.

What "long context" actually costs

Longer windows are not free. The transformer's attention mechanism scales quadratically with sequence length, in the simple version. Doubling the context doesn't double the cost. It roughly quadruples it.

Providers have engineered around the worst of this with sparse attention and caching tricks, but the bill still climbs:

  • Sending 200K input tokens to a frontier model costs more than sending 2K, proportionally, every time.
  • Output time gets slower as the context grows. The first token after a 500K-token prompt takes seconds, not milliseconds.
  • Quality also degrades past a certain length. Models reliably retrieve facts in the first and last 10% of a long context. The middle 80%, the so-called "lost in the middle" effect, is where they miss things.

More context isn't always better. The right context, ordered well, beats a dump of everything you have.

Why models hallucinate

This is where the next-token-prediction picture from post 5 pays off. An LLM is, mechanically, a function that produces a probability distribution over the next token. It always produces one. It cannot output "I don't know" unless it has been trained to recognise the situations where "I don't know" is the most likely next token.

When you ask a model about something that genuinely isn't in its training data and isn't in its current context, the math doesn't stop. It picks the next-most-plausible-looking continuation. That continuation is a hallucination: output that looks like an answer but isn't grounded in anything real.

Hallucinations are not bugs. They are the model doing exactly what it was trained to do, applied to a question where the right answer was never available. The fix is not "make the model better at admitting ignorance" alone. It is putting the right information into the context window so the model has something true to predict from.

Which hints at the practical pattern: if the model needs to answer questions about your company's docs, your codebase, last week's news, you have to get those tokens into the window. That's where retrieval comes in.

Things this explains

  • Why ChatGPT can't tell you the date. It's not in the context unless the system prompt put it there.
  • Why a model confidently invents a paper citation that doesn't exist. The shape "Smith et al., 2021" is a high-probability continuation in academic-flavoured text, regardless of whether that specific paper was in training.
  • Why pasting a long PDF and asking a question often works better than asking the model from memory. You moved the answer from "hopefully the model remembers" to "definitely in the context".
  • Why your AI coding assistant sometimes calls a function that doesn't exist in your codebase. It's hallucinating a plausible-sounding API.

What to take away

  • The context window is the model's working memory, measured in tokens. Everything outside is invisible.
  • 2026 limits range from 8K (small open models) to 1M+ (frontier closed models). 200K is comfortable for most real work.
  • Long context costs proportionally more, gets slower, and degrades quality in the middle. Right-sized context beats stuffed context.
  • Hallucinations are the model doing its job (predict the next token) on a question where the answer wasn't available. The fix is changing what's in the window, not begging the model to be careful.

If the answer to most LLM problems is "put the right tokens in the window", the obvious next question is: how do you choose them? That's RAG. Post 7.

From the dictionary

Terms used in this post

Quick reference for the 18 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI
Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
AttentionDL
The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
ChatGPTAI
OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI
Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP
The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
DatasetData
The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
GeminiAI
Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GPTAI
OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
HallucinationNLP
When an LLM produces output that looks like an answer but isnt grounded in anything real. Not a bug — the consequence of next-token prediction applied to questions where the right answer wasnt available.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP
Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP
A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML
The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
TransformerDL
The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.
WeightsML
The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...