From models to LLMs
An LLM is one kind of ML model — trained on text, predicts the next token. That single trick at scale gets you ChatGPT, and also explains where it breaks.

Posts 1–4 covered AI, ML, DL, and what a model is. The model everyone actually points at in 2026 is a specific shape — an LLM. This post is about what makes an LLM an LLM.
This is post 5 of 8 in the Foundations series.
What an LLM is, mechanically

A Large Language Model is a deep neural network trained on a massive amount of text, with one job: given a sequence of text so far, predict what comes next.
That's the entire trick. Next-token prediction. The model takes a string like "The capital of France is" and outputs a probability distribution over its vocabulary: maybe 90% on "Paris", 3% on "located", 1% on "the", and so on. You sample one, append it to the input, feed it back in, and repeat. That loop, running 50 to 5000 times, is what produces the answer ChatGPT shows you.
A token is roughly a chunk of text. Sometimes a whole word, sometimes a piece of one. "Paris" is one token. "unbelievable" is usually three ("un", "believ", "able"). English averages around 4 characters per token. Tokens are the unit the model thinks in, the unit you're billed for, and the unit the context window is measured in.
The architecture inside is the transformer, introduced by Google in 2017 in a paper called "Attention Is All You Need". Every modern LLM (GPT, Claude, Gemini, Llama, Mistral, Qwen, DeepSeek) is a transformer with variations. The 2017 paper has been cited over 150,000 times; it's one of the most consequential pieces of computer-science research this century.
Why next-token prediction gets you ChatGPT
This is the part that confuses people, fairly. "Predict the next word" sounds like autocomplete. How does autocomplete write essays, debug code, and pass medical licensing exams?
The answer is scale plus a second training stage.
Scale first. A 7B-parameter model trained on a few hundred billion tokens of text is interesting: it can complete sentences and write surface-level code. A 70B model trained on a few trillion tokens starts to feel like it understands. A 400B+ model trained on the bulk of the public internet is what powers the chatbots you use. Same architecture, more parameters, more data, more compute. Capability emerges from scale in a way that surprised even the people building these systems.
The second training stage is what turns the raw model into a chatbot. After pre-training on raw text, the model is fine-tuned on examples of helpful, harmless, accurate responses, often using reinforcement learning from human feedback (RLHF). Pre-training teaches it the shape of language. Fine-tuning teaches it to be a useful assistant.
Without fine-tuning, a raw LLM given "What is the capital of France?" might continue with another question, because in raw text questions are often followed by more questions. With fine-tuning, it learns that the right continuation is an answer.
Where the illusion of understanding breaks
LLMs feel like they understand because they produce outputs that, on the surface, look exactly like understanding. But the mechanism is still next-token prediction over patterns it saw in training.
Three predictable failure modes:
Math past short arithmetic. LLMs do not run a calculator internally. They predict that the next token after "347 × 891 =" is whatever token most often followed similar strings in training. For numbers small enough to be memorised, they're correct. For arbitrary numbers, they confabulate plausible-looking digits. This is why every serious AI product wires up an actual calculator as a tool.
Knowledge past the training cutoff. A model trained on data up to October 2025 does not know what happened in December 2025. It will sometimes admit that. It will sometimes guess. The guess will sound exactly like the truth.
Anything novel that requires reasoning the model has never seen. LLMs are very good at recombining things they've seen. They are bad at problems that require a reasoning step no human has previously written down on the internet.
The useful framing: an LLM is a search engine over compressed patterns of human-written text. When the answer is already in the patterns, it shines. When it isn't, it makes one up that fits the shape.
What to take away
- An LLM is a transformer trained to predict the next token in a sequence. The whole user-facing experience is that loop run repeatedly.
- Tokens are the unit: of computation, of pricing, of context. ~4 characters per token in English.
- Capability comes from scale plus a fine-tuning step that turns the raw text predictor into an assistant.
- The illusion of understanding breaks predictably: arithmetic past memorisation, knowledge past the cutoff, and novel reasoning. Pattern-match the failure mode, not the surface.
The biggest practical limit on LLMs isn't covered yet: the model can only see a finite amount of text at once. That limit shapes everything you do with them. Post 6.
From the dictionary
Terms used in this post
Quick reference for the 19 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- AttentionDL
- The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
- ChatGPTAI
- OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
- ClaudeAI
- Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
- e.g. This blog's create-post skill drafts inline using Claude.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- Deep LearningDL
- A subset of machine learning that uses neural networks with many layers ("deep" stacks). Powers image recognition, speech, and the LLMs behind ChatGPT/Claude/Gemini. Needs much more data and compute than classical ML, but scales further.
- e.g. Every modern LLM is a deep-learning model — a transformer with billions of parameters trained on internet-scale text.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GeminiAI
- Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
- e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
- GPTAI
- OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- Machine LearningML
- A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.
- e.g. Gmail's spam filter learns which emails you mark as junk and updates its model — that's machine learning, not a rule someone wrote.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- Neural NetworkDL
- A model architecture loosely inspired by neurons in the brain — in practice, a stack of matrix multiplications with non-linear functions between them. Deep learning is what you get when you stack many layers of these and train them on a lot of data.
- Next-Token PredictionNLP
- The training objective of every modern LLM: given a sequence of tokens so far, predict the most likely next token. Run this in a loop and you get ChatGPT.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- RLHFML
- The training stage where a pre-trained LLM is tuned with human preferences (people rank the models outputs, the model learns to produce the ones humans prefer). Turns a raw text predictor into a useful assistant.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- TransformerDL
- The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.
Rate this article
How helpful did you find this?
- 01
AI, in plain words
February 24, 2026
- 02
Inside AI: machine learning and deep learning
February 26, 2026
- 03
What makes a model: data and algorithm
March 1, 2026
- 04
How a model learns: training and inference
March 3, 2026
- 05
From models to LLMs
March 6, 2026
- 06
The context window, and why models hallucinate
March 8, 2026
- 07
RAG: giving a model memory it doesn't have
March 11, 2026
- 08
Prompt, RAG, fine-tune: three ways to shape a model
March 13, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...