How a model learns: training and inference
Training is the expensive one-time event where a model's numbers get tuned. Inference is the cheap repeated use afterwards. The gap in cost is enormous, and it shapes the whole industry.

Post 3 ended on "a model is learned numbers". This post is about how the learning happens, and why running a model afterwards is a completely different problem.
This is post 4 of 8 in the Foundations series.
Training: the expensive event

Training is the loop that turns random numbers into useful numbers. The shape is always the same:
- Show the model an example from the training data.
- Have it make a prediction.
- Measure how wrong the prediction is. That number is the loss.
- Adjust every parameter slightly in the direction that would reduce the loss.
- Repeat. A few billion times.
The adjustment in step 4 is gradient descent. The math that figures out which way each parameter should move is called backpropagation. It walks the error backwards through every layer of the network and computes a tiny correction for each weight. Repeat enough times and the numbers settle into values that make the model good at the task.
A full pass through the training data is called an epoch. Most large models are trained for a fraction of an epoch on a dataset so big that one pass is already enough.
What training actually costs in 2026:
- A small open-source model (7B parameters) trained from scratch on public data: hundreds of thousands of dollars, weeks on a few hundred GPUs.
- A frontier model like GPT-4 or Claude: reportedly $100M+, months on tens of thousands of GPUs. Meta said Llama 3 405B used about 16,000 H100s.
- The electricity bill alone runs to millions.
Training is also irreversible. If your data was bad, you don't fix it by training a little more. You start over, or you do a smaller corrective fine-tune. That's why teams obsess about data curation before they spend a single GPU-hour.
Inference: the cheap repeated use
Once training is done, the file of numbers is frozen. Using the model, that is, taking a new input and producing an output, is called inference.
Inference is a single forward pass through the network. No gradient descent, no backpropagation, just multiplication and addition with the saved weights. For a model the user can talk to, this happens once per output token (more on that in post 5).
A Llama 3 8B model can do inference comfortably on a recent MacBook. That same model required hundreds of GPUs to train. The asymmetry is the whole point. Pay the training cost once, do inference billions of times.
That asymmetry is also why the LLM API economy exists. OpenAI and Anthropic spent hundreds of millions training their models. They charge per million tokens of inference. The marginal cost of one more API call is dollars, sometimes cents. The marginal cost of one more training run is millions.
Why GPUs matter, and where they matter less
Training is bottlenecked by raw matrix multiplication throughput across many chips talking to each other. GPUs (and TPUs) are built for exactly that. CPUs would technically work but would take centuries.
Inference is also matrix multiplication, but a much smaller one. One input at a time, no gradient computation, no all-to-all communication between chips. A GPU still helps a lot, but you can do inference on:
- A single high-end GPU for medium-sized models (7B–70B parameters).
- A CPU for small models (1B–3B parameters), slowly but usably.
- Apple Silicon, surprisingly well, because of unified memory. Covered in the Running series.
The practical takeaway: when a company says "we need GPUs", ask whether they mean for training or for inference. The shape of those two needs is very different. Training wants huge clusters with fat interconnects. Inference wants many small machines close to users.
What weights really are
A throwaway question that people get wrong: what does it mean when you read "the weights are released"?
It means the model file (the trained numbers) is downloadable. Not the training data, not the training code, just the final parameters. With weights you can run inference. You can fine-tune. You cannot reproduce the training without also having the data and the code.
"Open weights" (Llama, Mistral, Qwen, DeepSeek) is not the same as "open source". It's closer to a freely-redistributable binary. Most people don't care about the distinction; researchers do.
What to take away
- Training is the one-time expensive process: forward pass, measure loss, adjust weights, repeat. Frontier models cost $100M+ and tens of thousands of GPUs.
- Inference is the cheap repeated use: forward pass only, with frozen weights. The marginal cost is small.
- The asymmetry between training and inference cost is what the entire AI API industry is built on.
- "Weights released" means the trained numbers are downloadable. Not the same as open source.
What I just described (training, inference, weights) applies to every modern ML model. Spam filters, image classifiers, the lot. But the model everyone actually means in 2026 is a specific kind: an LLM. That's where post 5 picks up.
From the dictionary
Terms used in this post
Quick reference for the 19 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- BackpropagationDL
- The algorithm that computes how each parameter in a neural network should be adjusted to reduce the loss. It walks the error backwards through every layer using the chain rule. Combined with gradient descent, it is how networks learn.
- ClaudeAI
- Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
- e.g. This blog's create-post skill drafts inline using Claude.
- DatasetData
- The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
- Deep LearningDL
- A subset of machine learning that uses neural networks with many layers ("deep" stacks). Powers image recognition, speech, and the LLMs behind ChatGPT/Claude/Gemini. Needs much more data and compute than classical ML, but scales further.
- e.g. Every modern LLM is a deep-learning model — a transformer with billions of parameters trained on internet-scale text.
- EpochML
- One full pass through the training dataset. Most large models are trained for less than a single epoch on a dataset so big that one pass is enough.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GPTAI
- OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- Gradient DescentML
- The optimisation method behind most ML training: nudge each parameter in the direction that reduces the loss, by a small step. Repeat a few billion times and the model converges.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- LossML
- A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
- Machine LearningML
- A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.
- e.g. Gmail's spam filter learns which emails you mark as junk and updates its model — that's machine learning, not a rule someone wrote.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
AI, in plain words
February 24, 2026
- 02
Inside AI: machine learning and deep learning
February 26, 2026
- 03
What makes a model: data and algorithm
March 1, 2026
- 04
How a model learns: training and inference
March 3, 2026
- 05
From models to LLMs
March 6, 2026
- 06
The context window, and why models hallucinate
March 8, 2026
- 07
RAG: giving a model memory it doesn't have
March 11, 2026
- 08
Prompt, RAG, fine-tune: three ways to shape a model
March 13, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...