Prompt, RAG, fine-tune: three ways to shape a model
Three levers for shaping what an LLM does: prompting (ask better), RAG (give it the right context), fine-tuning (change the weights). What each costs, what each fixes, and how to pick.

This is post 8 of 8 in the Foundations series. The first seven posts built up to one practical question: when you have an LLM and a use case, how do you make it work for you?
There are three levers. They aren't substitutes; they fix different things, at different costs.
The three levers
Prompting. Change what you write in the message. System prompts, instructions, examples in-context, output format hints. No code, no infrastructure. Iteration is seconds.
RAG. Retrieve relevant data from your own corpus and paste it into the context window before the question. Some infrastructure (a vector store, an embedding model, retrieval code), but no model training. Iteration is minutes to hours.
Fine-tuning. Continue training the model on your own examples so new behaviours get baked into the weights. Real infrastructure (training pipeline, GPU budget, eval harness), and re-training every time you want to change. Iteration is hours to days, sometimes more.
All three change what comes out of the model. None of them change what the model is, except fine-tuning, which changes a copy of it.
What each one actually changes

Prompting changes the input. The weights are frozen. The context window is full of whatever you wrote. Good prompting is mostly about giving the model the right framing, the right format, and one or two concrete examples ("few-shot"). The biggest under-rated technique is just telling the model the role it should play and the structure of the output you want. Most "the model isn't doing what I want" problems are solved here.
RAG changes the input too — it just generates the input dynamically per query, based on retrieved data. The weights are still frozen. The model sees a different prompt every time, depending on what was relevant. RAG fixes "the model doesn't know my data". It does not fix "the model talks in the wrong tone" — that's still a prompting problem.
Fine-tuning changes the weights. The numbers in the model file are different after fine-tuning than before. Patterns that were rare in the original training become common in the fine-tuned version. Used right, fine-tuning bakes in behaviours and styles. Used wrong, it makes the model worse at general tasks (catastrophic forgetting) without making it meaningfully better at the target task.
What each one costs
Numbers that hold roughly true in 2026:
- Prompting: free in dollars. Costs your time. Most of the cost is iteration cycles, which is why short-cycle prompt-and-test loops matter more than clever prompts.
- RAG: tens to hundreds of dollars to build a small system, low monthly bill for vector storage, plus your inference bill. Embedding 10 million tokens costs roughly a dollar at current OpenAI rates.
- Fine-tuning: cheapest case is a hosted fine-tune of a small model, a few hundred dollars. Largest case — a custom fine-tune of a 70B+ model — is tens of thousands. Plus the eval cost; without good evals, fine-tuning is a coin flip.
All three add inference cost on top, since you're still calling the model once per request. RAG adds the most, because it grows the prompt size.
When to pick which
My rules of thumb, after building real things with all three:
Start with prompting. Always. Most problems are prompting problems. If you can fix it in the system prompt, do that and stop.
Reach for RAG when the model needs facts it doesn't have. Your docs, your codebase, your customer's account history, anything that changes or is private. RAG is the right answer 80% of the time when prompting alone isn't.
Reach for fine-tuning when you've maxed prompting and RAG, and the gap is style or behaviour. Specific output formats, domain-specific tone, classification tasks at scale where the prompt would otherwise be huge. Fine-tuning is rarely the first answer; it's often the last 10%.
A mistake I see often: teams jumping to fine-tuning because it sounds technical, then discovering they were really fixing a prompting problem. The fine-tune works, but it would have worked as a prompt for a fraction of the cost.
Another mistake: stuffing everything into the prompt as the system grows. Past a certain size, the prompt is its own corpus, and the right move is to retrieve from it instead of always sending the whole thing.
Combine them
The real systems run all three at once. A production setup looks like:
- A carefully written system prompt with role, format, and constraints.
- RAG fetching the relevant slice of your corpus per query.
- Optionally, a fine-tuned model that's already good at your specific output format.
Each lever is multiplicative. A great prompt with the wrong context still fails. A great context with a sloppy prompt still fails. A great fine-tune with neither doesn't matter.
What to take away
- Three levers, three different problems: prompting (how you ask), RAG (what data the model can see), fine-tuning (what behaviour the model defaults to).
- Costs scale up the same way: prompting is free, RAG is cheap, fine-tuning is expensive and only sometimes worth it.
- Default order: prompt first, then RAG, then fine-tune. Most teams do this in reverse and burn money.
- In production, all three combine. None of them is the answer alone.
That closes the Foundations series. Eight posts, one arc: from "AI is a marketing word" to "here's exactly which lever to reach for and why". The next series picks up from a different angle — now that you understand these models, where do they actually run? Cloud, your laptop, the device in your hand. Tradeoffs, hardware, runtimes.
From the dictionary
Terms used in this post
Quick reference for the 16 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Few-ShotNLP
- Showing the LLM a handful of input/output examples in the prompt before the real query, so it picks up the pattern. Cheap and effective; usually the next thing to try after a plain prompt.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- System PromptNLP
- A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- Vector DatabaseData
- A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
AI, in plain words
February 24, 2026
- 02
Inside AI: machine learning and deep learning
February 26, 2026
- 03
What makes a model: data and algorithm
March 1, 2026
- 04
How a model learns: training and inference
March 3, 2026
- 05
From models to LLMs
March 6, 2026
- 06
The context window, and why models hallucinate
March 8, 2026
- 07
RAG: giving a model memory it doesn't have
March 11, 2026
- 08
Prompt, RAG, fine-tune: three ways to shape a model
March 13, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...