March 11, 20264 min read

RAG: giving a model memory it doesn't have

RAG is the pattern of fetching relevant text from a search system and putting it in the LLM's context window before asking your question. Not magic, not fine-tuning, just better prompts.

Post 6 ended on a hard limit: an LLM only knows what's baked into its weights plus whatever is in its context window right now. RAG is the most common way people get the right things into that window.

This is post 7 of 8 in the Foundations series.

What RAG actually means

retrieve, augment, generate, the RAG flow

RAG stands for Retrieval-Augmented Generation, which is a mouthful for three plain steps.

Retrieve. Take the user's question and search a corpus (your docs, your codebase, your wiki) for the chunks of text most relevant to it.
Augment. Drop those chunks into the LLM's prompt, right alongside the question.
Generate. Let the LLM answer, now that the fetched text is sitting in its context window.

That's the whole idea. No special model required. RAG runs on the same LLM you'd use without it. The only thing that changes is what's in the window when the model answers.

The word "retrieval" makes it sound like wizardry. It isn't. It's a search query whose results get pasted into the prompt ahead of the question. Every RAG system on earth, from a 100-line script to a sprawling enterprise stack, is a variation on that one move.

How retrieval actually works

The naive version of step 1 is keyword search. It's fine for some cases. The standard 2026 version is vector search built on embeddings.

An embedding is a list of numbers that captures the meaning of a piece of text. Two passages that mean similar things land close together in that number space, even when they share no actual words. "How do I cancel my subscription?" and "Stop my membership" embed right next to each other. A keyword search would sail straight past the second one.

The pipeline has two halves.

Once, ahead of time: split your corpus into chunks (paragraphs, sections), run each chunk through an embedding model, and store the resulting vectors in a vector database (Pinecone, Weaviate, pgvector, or just a flat file if the corpus is small).
At query time: embed the user's question, find the K nearest chunks in the vector store, and hand those back.

In practice, good systems run hybrid retrieval: vector search and keyword search together, with a re-ranker sitting on top. Pure vector search has its own bad habits, like cheerfully returning chunks that are related to the question without actually answering it. Most of the real engineering in production RAG is getting retrieval to be relevant and precise at the same time.

Why RAG isn't fine-tuning

This is the distinction that trips people up, so it's worth being blunt about it.

Fine-tuning changes the model's weights. You take an existing LLM and keep training it on your data until the new patterns are baked into the parameters. It's expensive, slow, and not easily undone short of rolling back to the old weights.

RAG doesn't lay a finger on the model. The weights stay frozen. Every query, the model is simply handed fresh text in its context, and it keeps no memory of the last query. Update your docs and RAG picks up the new version on the very next question. Fine-tuning would need a whole re-training run to do the same.

So the rule of thumb is clean. For factual knowledge that changes, RAG is almost always the right tool, because knowledge moves, re-training is costly, and updating an index is cheap. For changing the model's style, behaviour, or domain-specific reasoning, fine-tuning sometimes wins. Stable knowledge, reasoning patterns peculiar to your data, fixed output formats: those bake in well.

When RAG falls over

RAG is no silver bullet, and it fails in three recognisable ways.

Retrieval misses. If your search hands back the wrong chunks, the LLM has nothing real to ground in. It will either say "I don't know," if you're lucky and prompted it well, or hallucinate, if you weren't. Nearly every "my RAG is bad" complaint I've seen is a retrieval problem wearing an LLM costume.

Chunks are the wrong size. Tiny chunks lose their surrounding context. Huge chunks elbow other relevant material out of the window. There's no universal right answer here. Somewhere between 200 and 1,000 tokens is typical, tuned per corpus.

The model ignores the context. This is the lost-in-the-middle effect from post 6. Stuff 50 chunks into the prompt with the crucial one at position 25, and the model may walk right past it. Re-rankers and smarter ordering help.

A RAG system is only as strong as its weakest link, and the weak link is almost always retrieval. Fix that before you touch anything else.

RAG is one of three ways to shape what an LLM does for you. The other two are prompting (ask better) and fine-tuning (change the weights). Knowing which one to reach for, and when, is the last piece of the puzzle. That's post 8, and the close of this series.

AI Embeddings Fundamentals LLM RAG

From the dictionary

Terms used in this post

Quick reference for the 14 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
HallucinationNLP: When an LLM produces output that looks like an answer but isnt grounded in anything real. Not a bug — the consequence of next-token prediction applied to questions where the right answer wasnt available.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
Vector DatabaseData: A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

AI Foundations

7 / 8 posts

Browse all in AI Foundations →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
HallucinationNLP: When an LLM produces output that looks like an answer but isnt grounded in anything real. Not a bug — the consequence of next-token prediction applied to questions where the right answer wasnt available.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
Vector DatabaseData: A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

AI Foundations

7 / 8 posts

Browse all in AI Foundations →