March 3, 20264 min read

How a model learns: training and inference

Training is the expensive one-time event where a model's numbers get tuned. Inference is the cheap repeated use afterwards. The gap in cost is enormous, and it shapes the whole industry.

Post 3 left off on "a model is just learned numbers." This post is about how those numbers actually get learned, and why running the model afterwards is a completely different problem with a completely different price tag.

This is post 4 of 8 in the Foundations series.

Training: the expensive part

the training loop: forward pass, loss, backprop, update

Training is the loop that turns random numbers into useful ones. The shape never really changes.

Show the model an example from the training data.
Let it make a prediction.
Measure how wrong that prediction was. That number is the loss.
Nudge every parameter a touch in the direction that would shrink the loss.
Repeat. A few billion times.

Step 4 is gradient descent. The math that works out which way each parameter should move is backpropagation: it walks the error backwards through every layer and computes a tiny correction for each weight. Do that enough times and the numbers settle into values that make the model good at the job. One full sweep through the training data is an epoch, though the biggest models train on a dataset so large that a fraction of a single pass is already plenty.

Here's what that costs in 2026.

A small open model (7B parameters) trained from scratch on public data: hundreds of thousands of dollars, and weeks on a few hundred GPUs.
A frontier model like GPT-4 or Claude: reportedly north of $100M, months on tens of thousands of GPUs. Meta said Llama 3 405B took roughly 16,000 H100s.
The electricity alone runs into the millions.

And training is close to irreversible. If your data was bad, you don't patch it by training a bit more. You start over, or you run a smaller corrective fine-tune. That's exactly why serious teams obsess over data curation before they burn a single GPU-hour.

Inference: the cheap part you repeat forever

Once training finishes, the file of numbers freezes. Using the model, feeding it a new input and getting an output, is called inference.

Inference is a single forward pass through the network. No gradient descent, no backpropagation, just multiplication and addition with the frozen weights. For a model you can chat with, that pass happens once per output token, which post 5 gets into.

A Llama 3 8B model does inference comfortably on a recent MacBook. The same model needed hundreds of GPUs to train. That asymmetry is the whole point: pay the training bill once, run inference billions of times.

It's also the reason the LLM API economy exists at all. OpenAI and Anthropic spent hundreds of millions training their models, then charge by the million tokens of inference. One more API call costs them dollars, sometimes cents. One more training run costs millions. Guess which one they're happy to sell you.

Why GPUs matter for one and less for the other

Training is bottlenecked by raw matrix-multiplication throughput across many chips that have to talk to each other constantly. GPUs and TPUs are built for precisely that. CPUs would technically get there, in a few centuries.

Inference is also matrix multiplication, but a far smaller one: one input at a time, no gradients, no all-to-all chatter between chips. A GPU still helps, but you have options.

A single high-end GPU handles medium models (7B to 70B parameters).
A CPU runs small models (1B to 3B parameters), slowly but usably.
Apple Silicon does surprisingly well, thanks to unified memory. The Running series digs into that.

So when a company says "we need GPUs," the useful question is: for training or for inference? The two needs barely resemble each other. Training wants giant clusters with fat interconnects. Inference wants lots of small machines sitting close to users.

What "the weights are released" actually means

People trip over this one, so it's worth thirty seconds. When you read that a lab "released the weights," it means the model file, the trained numbers, is downloadable. Not the training data. Not the training code. Just the final parameters.

With the weights you can run inference and you can fine-tune. What you can't do is reproduce the training, because you don't have the data or the code. So "open weights" (Llama, Mistral, Qwen, DeepSeek) isn't the same thing as "open source." It's closer to a binary you're free to redistribute. Most people never need to care about the gap. Researchers do.

Everything here, training, inference, weights, applies to every modern ML model: spam filters, image classifiers, all of it. But the model everyone actually means when they say "AI" in 2026 is one specific kind, the LLM. That's post 5.

AI Fundamentals Inference ML Training

From the dictionary

Terms used in this post

Quick reference for the 20 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
BackpropagationDL: The algorithm that computes how each parameter in a neural network should be adjusted to reduce the loss. It walks the error backwards through every layer using the chain rule. Combined with gradient descent, it is how networks learn.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
DatasetData: The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
Deep LearningDL: A subset of machine learning that uses neural networks with many layers ("deep" stacks). Powers image recognition, speech, and the LLMs behind ChatGPT/Claude/Gemini. Needs much more data and compute than classical ML, but scales further.; e.g. Every modern LLM is a deep-learning model — a transformer with billions of parameters trained on internet-scale text.
EpochML: One full pass through the training dataset. Most large models are trained for less than a single epoch on a dataset so big that one pass is enough.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
Gradient DescentML: The optimisation method behind most ML training: nudge each parameter in the direction that reduces the loss, by a small step. Repeat a few billion times and the model converges.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LossML: A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
Machine LearningML: A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.; e.g. Gmail's spam filter learns which emails you mark as junk and updates its model — that's machine learning, not a rule someone wrote.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.

Rate this article

How helpful did you find this?

Series

AI Foundations

4 / 8 posts

Browse all in AI Foundations →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
BackpropagationDL: The algorithm that computes how each parameter in a neural network should be adjusted to reduce the loss. It walks the error backwards through every layer using the chain rule. Combined with gradient descent, it is how networks learn.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
DatasetData: The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
Deep LearningDL: A subset of machine learning that uses neural networks with many layers ("deep" stacks). Powers image recognition, speech, and the LLMs behind ChatGPT/Claude/Gemini. Needs much more data and compute than classical ML, but scales further.
EpochML: One full pass through the training dataset. Most large models are trained for less than a single epoch on a dataset so big that one pass is enough.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
Gradient DescentML: The optimisation method behind most ML training: nudge each parameter in the direction that reduces the loss, by a small step. Repeat a few billion times and the model converges.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LossML: A number that measures how wrong the models prediction is, compared to the truth. Training is the process of changing weights so this number goes down.
Machine LearningML: A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.

Series

AI Foundations

4 / 8 posts

Browse all in AI Foundations →