RAG: giving a model memory it doesn't have
RAG is the pattern of fetching relevant text from a search system and putting it in the LLM's context window before asking your question. Not magic, not fine-tuning — just better prompts.

Post 6 ended with: an LLM only knows what's in its weights and what's in its context window right now. RAG is the most common way of getting the right things into that window.
This is post 7 of 8 in the Foundations series.
What RAG actually means

RAG stands for Retrieval-Augmented Generation. It's three steps:
- Retrieve. Take the user's question. Search a corpus (your docs, your codebase, your wiki) for the most relevant chunks of text.
- Augment. Stick those chunks into the LLM's prompt, alongside the question.
- Generate. Let the LLM answer, with the fetched text now visible in its context window.
That's it. There's no special model required. RAG works with the same LLM you'd use without it. The only thing that changes is what's sitting in the context window when the model answers.
The word "retrieval" makes it sound exotic. It isn't. It's a search query whose results get pasted into the prompt before the question. Every RAG system, from a 100-line script to a full enterprise stack, is a variation on that pattern.
How retrieval is actually done
The naive version of step 1 is keyword search. It works for some cases. The standard 2026 version is vector search using embeddings.
An embedding is a list of numbers that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings that are close together in space, even if they don't share keywords. "How do I cancel my subscription?" and "Stop my membership" embed near each other; a keyword search would miss the second one.
The pipeline:
- Once, ahead of time: split your corpus into chunks (paragraphs, sections), run each chunk through an embedding model, store the resulting vectors in a vector database (Pinecone, Weaviate, pgvector, or just a flat file for small corpora).
- At query time: embed the user's question, find the K nearest chunks in the vector store, return those chunks.
Real systems usually do hybrid retrieval — vector search plus keyword search, with a re-ranker on top. Pure vector search has its own failure modes (it'll happily return semantically related chunks that don't actually answer the question). Production RAG is mostly the engineering of getting retrieval to be both relevant and precise.
Why RAG isn't fine-tuning
This distinction trips a lot of people up.
Fine-tuning changes the model's weights. You take an existing LLM and continue training it on your data, so the new patterns get baked into the parameters. Expensive, slow, irreversible without rolling back to the old weights.
RAG doesn't touch the model. The weights are frozen. The model is, every single query, given fresh text in its context. The model has no memory of the previous query. If you update your docs, RAG sees the updated version on the next query. Fine-tuning would need a re-train.
The practical implication: for changing factual knowledge, RAG is almost always the right tool. Knowledge changes. Re-training is expensive. Updating an index is cheap.
For changing the model's style, behaviour, or domain-specific reasoning, fine-tuning is sometimes the right tool. Knowledge that's stable, reasoning patterns specific to your data, output formats — those bake well.
When RAG fails
RAG isn't a silver bullet. Three common failure modes:
The retrieval misses. If your search returns the wrong chunks, the LLM has nothing relevant to ground in. It'll either say "I don't know" (if you're lucky and prompted well) or hallucinate (if you weren't). Most "my RAG system is bad" complaints are retrieval problems, not LLM problems.
The chunks are too small or too big. Tiny chunks lose context. Huge chunks crowd out other relevant material in the window. There's no universal right size — between 200 and 1000 tokens is typical, tuned per corpus.
The model ignores the context. This is the lost-in-the-middle problem from post 6. If you stuff 50 chunks into the prompt and the relevant one is at chunk 25, the model might miss it. Re-rankers and ordering tricks help.
A RAG system is only as good as its weakest link, and the weak link is almost always retrieval.
What to take away
- RAG is: search your corpus, paste relevant chunks into the prompt, ask the LLM. The whole pattern is at the prompt layer; the model itself doesn't change.
- Modern retrieval uses embeddings — vectors that capture meaning so semantically-similar text matches even without shared keywords.
- RAG is the right tool for knowledge that changes. Fine-tuning is the right tool for behaviour and style that should stick.
- Most RAG failures are retrieval failures. Tune retrieval before you tune anything else.
RAG is one of three ways to shape what an LLM does for you. The others are prompting (ask better) and fine-tuning (change the weights). Knowing which to reach for is the last piece. Post 8 — and the close of this series.
From the dictionary
Terms used in this post
Quick reference for the 14 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- HallucinationNLP
- When an LLM produces output that looks like an answer but isnt grounded in anything real. Not a bug — the consequence of next-token prediction applied to questions where the right answer wasnt available.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- Vector DatabaseData
- A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
AI, in plain words
February 24, 2026
- 02
Inside AI: machine learning and deep learning
February 26, 2026
- 03
What makes a model: data and algorithm
March 1, 2026
- 04
How a model learns: training and inference
March 3, 2026
- 05
From models to LLMs
March 6, 2026
- 06
The context window, and why models hallucinate
March 8, 2026
- 07
RAG: giving a model memory it doesn't have
March 11, 2026
- 08
Prompt, RAG, fine-tune: three ways to shape a model
March 13, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...