March 6, 20264 min read

From models to LLMs

An LLM is one kind of ML model, trained on text, predicts the next token. That single trick at scale gets you ChatGPT, and also explains where it breaks.

Posts 1 through 4 walked through AI, ML, DL, and what a model really is. The model everyone actually points at when they say "AI" in 2026 has one specific shape: the LLM. This post is about what makes an LLM an LLM, and why the same trick that makes it brilliant also tells you exactly where it will fall over.

This is post 5 of 8 in the Foundations series.

What an LLM is, mechanically

next-token prediction loop: tokens in, probabilities out, sample, append

A Large Language Model is a deep neural network trained on an enormous amount of text to do one thing: given the text so far, predict what comes next.

That's the entire trick. Next-token prediction. Hand the model a string like "The capital of France is" and it returns a probability spread over its whole vocabulary: maybe 90% on "Paris," 3% on "located," 1% on "the," and so on down the list. You pick one, stick it onto the end of the input, feed the whole thing back in, and go again. Run that loop somewhere between 50 and 5,000 times and you get the answer ChatGPT prints for you.

A token is roughly a chunk of text. Sometimes a whole word, sometimes a fragment of one. "Paris" is a single token. "unbelievable" is usually three ("un," "believ," "able"). English runs about 4 characters to a token. Tokens are the unit the model thinks in, the unit you pay for, and the unit the context window is measured in. Worth holding onto, because they turn up everywhere from here on.

The machinery inside is the transformer, which Google introduced in 2017 in a paper with the now-famous title "Attention Is All You Need." Every modern LLM, whether that's GPT, Claude, Gemini, Llama, Mistral, Qwen, or DeepSeek, is a transformer with its own tweaks. That one paper has been cited more than 150,000 times. It's probably the single most consequential piece of computer-science research this century.

Why "predict the next word" gets you ChatGPT

This is the part that throws people, and fairly so. "Predict the next word" sounds like phone autocomplete. How does autocomplete write essays, debug code, and pass medical licensing exams?

Two things: scale, and a second training stage.

Take scale first. A 7B-parameter model trained on a few hundred billion tokens of text is genuinely interesting. It finishes sentences and writes shallow code. A 70B model trained on a few trillion tokens starts to feel like it follows you. A 400B-plus model trained on most of the public internet is what sits behind the chatbots you actually use. Same architecture every time, just more parameters, more data, more compute. The capability climbs out of sheer scale in a way that genuinely surprised the people building these things.

The second stage is what turns a raw text-predictor into a chatbot. After pre-training on raw text, the model gets fine-tuned on examples of helpful, accurate, well-behaved answers, usually with reinforcement learning from human feedback (RLHF). Pre-training teaches it the shape of language. Fine-tuning teaches it to act like an assistant.

You can see the difference most clearly when it's missing. Give a raw, un-fine-tuned LLM the prompt "What is the capital of France?" and it might answer with another question, because in raw web text a question is often followed by more questions. Fine-tuning is what teaches it that the right continuation is an answer, not an echo.

Where the illusion of understanding cracks

LLMs feel like they understand because their output looks, on the surface, exactly like understanding. Underneath it's still next-token prediction over patterns soaked up during training. Once you know that, you can predict where they break, and there are three spots worth burning into memory.

Math beyond short arithmetic. An LLM isn't running a calculator inside. Faced with "347 x 891 =", it predicts whatever token most often followed strings like that in training. For small, often-seen numbers it's right. For arbitrary ones it confabulates digits that look plausible. That's why every serious AI product bolts on a real calculator as a tool.

Knowledge past the training cutoff. A model trained on data up to, say, October 2025 has no idea what happened in December 2025. Sometimes it admits that. Sometimes it guesses. The guess sounds identical to the truth.

Anything genuinely novel that needs unseen reasoning. LLMs are superb at recombining things they've already seen. They're weak at problems that need a reasoning step nobody has written down on the internet before.

The framing I keep coming back to: an LLM is a search engine over compressed patterns of human-written text. When your answer already lives in those patterns, it shines. When it doesn't, it invents something that fits the shape and hands it over with exactly the same confidence.

So the loop is simple, the scale is enormous, and the failure modes are predictable. But there's one limit sitting underneath all of it that I haven't touched yet: a model can only look at a finite amount of text at once. That ceiling shapes almost everything you do with LLMs. Post 6.

AI Fundamentals LLM ML Transformer

From the dictionary

Terms used in this post

Quick reference for the 19 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
AttentionDL: The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Deep LearningDL: A subset of machine learning that uses neural networks with many layers ("deep" stacks). Powers image recognition, speech, and the LLMs behind ChatGPT/Claude/Gemini. Needs much more data and compute than classical ML, but scales further.; e.g. Every modern LLM is a deep-learning model — a transformer with billions of parameters trained on internet-scale text.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.; e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
Machine LearningML: A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.; e.g. Gmail's spam filter learns which emails you mark as junk and updates its model — that's machine learning, not a rule someone wrote.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural NetworkDL: A model architecture loosely inspired by neurons in the brain — in practice, a stack of matrix multiplications with non-linear functions between them. Deep learning is what you get when you stack many layers of these and train them on a lot of data.
Next-Token PredictionNLP: The training objective of every modern LLM: given a sequence of tokens so far, predict the most likely next token. Run this in a loop and you get ChatGPT.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
RLHFML: The training stage where a pre-trained LLM is tuned with human preferences (people rank the models outputs, the model learns to produce the ones humans prefer). Turns a raw text predictor into a useful assistant.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
TransformerDL: The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.

Rate this article

How helpful did you find this?

Series

AI Foundations

5 / 8 posts

Browse all in AI Foundations →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
AttentionDL: The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Deep LearningDL: A subset of machine learning that uses neural networks with many layers ("deep" stacks). Powers image recognition, speech, and the LLMs behind ChatGPT/Claude/Gemini. Needs much more data and compute than classical ML, but scales further.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
Machine LearningML: A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural NetworkDL: A model architecture loosely inspired by neurons in the brain — in practice, a stack of matrix multiplications with non-linear functions between them. Deep learning is what you get when you stack many layers of these and train them on a lot of data.
Next-Token PredictionNLP: The training objective of every modern LLM: given a sequence of tokens so far, predict the most likely next token. Run this in a loop and you get ChatGPT.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
RLHFML: The training stage where a pre-trained LLM is tuned with human preferences (people rank the models outputs, the model learns to produce the ones humans prefer). Turns a raw text predictor into a useful assistant.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
TransformerDL: The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.

Series

AI Foundations

5 / 8 posts

Browse all in AI Foundations →