The major LLMs in 2026
A tour of the closed frontier models (Claude, GPT, Gemini) and the open weights (Llama, Qwen, DeepSeek, Mistral). What 'B' means, what each is good at, and which size to actually run.

"Which LLM should I use" is the wrong first question. The right one is: closed or open, and at what size. Once you answer that, the shortlist writes itself.
This is post 2 of 7 in the Running series. The previous post sorted out where AI runs (cloud, local, edge). This one is the model tour: who the players are in 2026, what their numbers actually mean, and what each is good at.
The closed frontier

Closed models live behind APIs. You don't get the weights, you don't run them, you pay per token. In return you get the absolute frontier of capability.
Three labs matter:
- Anthropic, Claude. Opus 4.7 is the current heavyweight, around 57 on Artificial Analysis' Intelligence Index and leading SWE-bench Pro at 64.3%. Sonnet 4.6 is the daily driver: cheaper, fast, smart enough for almost everything. Haiku 4.5 is the small-fast tier.
- OpenAI, GPT. GPT-5.5 sits at the top of the leaderboard, 60 on Intelligence Index in xhigh-reasoning mode and 82.7% on Terminal-Bench 2.0. GPT-5.4 and GPT-5-mini fill out the tier below.
- Google, Gemini. Gemini 3.1 Pro is at 57 on Intelligence Index, leads scientific reasoning at 94.3% on GPQA Diamond, and ships with the longest production context window (1M tokens, experimental builds at 10M).
You will not run any of these on your laptop. Opus 4.7 needs eight to thirty-two H100s just to load the weights. The frontier is a datacenter phenomenon.
What "closed" actually buys you in 2026:
- Tool use that works on the first try. The frontier models are trained heavily on agentic loops, and it shows.
- Long-context reasoning that doesn't fall apart at 200K tokens.
- A pricing curve that drops every six months without you having to do anything.
What it costs you: privacy (every token leaves your machine), vendor lock-in (your prompts are tuned to one model's quirks), and a per-call bill that scales linearly with usage.
The open weights
Open models ship the weights. You can download them, run them, fine-tune them, embed them in a product. You don't need a permission slip from anyone.
The 2026 lineup, in rough order of how often you will actually hear them:
- Llama 4 (Meta). Maverick is 400B total / 17B active (Mixture-of-Experts), 10M-token context. Scout is the smaller variant. Llama is the default open model: best documented, best supported in every runtime.
- Qwen 3.5 (Alibaba). Ships dense (0.5B to 32B) and MoE (235B-A22B, 397B-A17B). Qwen punches well above its weight on reasoning and coding benchmarks. The dense small models are what you actually run on a laptop.
- DeepSeek V4 (DeepSeek). 1.6T total / 49B active, with Hybrid Compressed Sparse Attention that cuts inference FLOPs 27% vs V3.2 at 1M tokens. The reasoning leader on the open side. Each release moved the price floor for everyone.
- Mistral Large 3 (Mistral). 675B / 41B MoE, European, strong on European languages and code.
- Gemma 4 (Google). Smaller, dense, designed for on-device. The model you pick when you want open weights but don't want MoE complexity.
- Kimi K2.6 (Moonshot). Highest-ranked open-weights model on Intelligence Index right now (54). Worth knowing; not yet the default.
What "B" actually means
When you see "Llama 4 Maverick: 400B / 17B", the two numbers tell different stories.
The first is total parameters: every weight in the model file. That sets your storage and your VRAM floor. 400B parameters at FP16 is 800GB on disk; quantized to 4-bit it is around 200GB. Either way, you are not loading that on a 16GB laptop.
The second is active parameters: how many of those weights actually fire on any given token. Mixture-of-Experts models route each token through a small subset of the network, not all of it. 17B active means the cost-per-token math behaves like a 17B dense model, even though the file is 400B.
The rule of thumb:
- Total parameters set the VRAM floor (you have to fit the file).
- Active parameters set inference cost and speed (only those compute on each token).
- Dense models have total = active. MoE models split them.
This is why DeepSeek V4 at 1.6T total / 49B active runs on the same hardware as a much larger dense model would, but at a fraction of the per-token cost. MoE is the default at frontier scale in 2026 because it decouples size from cost.
Sizes that matter on a laptop
Frontier models live in datacenters. The ones you'll actually run locally are smaller. The useful tiers, with rough VRAM at 4-bit quantization:
- 1B to 3B. Tiny. Runs on anything, including a phone. Good for autocomplete, classification, structured extraction. Not for reasoning. (Gemma 4 1B, Qwen 3.5 0.5B/1.5B/3B.)
- 7B to 8B. The sweet spot for laptops. Around 5GB on disk, 6 to 8GB VRAM. Smart enough for most chat, drafting, summarization. (Llama 4 8B, Qwen 3.5 7B, Mistral 7B.)
- 13B to 32B. The serious local tier. 8 to 24GB VRAM. Noticeably better at multi-step reasoning. (Qwen 3.5 32B, Gemma 4 27B.)
- 70B class. Workstation territory. 40 to 50GB at 4-bit. Needs an M-series Mac with 64GB+ unified memory or a multi-GPU rig. (Llama 4 70B equivalent, DeepSeek-Distill 70B.)
- MoE flagships (200B to 1.6T total). Server territory. You don't run these unless you have a real machine.
What each is actually good at
Benchmarks are noisy. Here is the working version, from sitting with these models day to day:
- Coding agents and tool use. Claude Opus 4.7, then GPT-5.5, then DeepSeek V4 on the open side. Anthropic's lead on SWE-bench is real and you feel it in long agentic loops.
- Pure reasoning and math. GPT-5.5 in high-reasoning mode, Gemini 3.1 Pro for scientific topics, DeepSeek V4 if you want open.
- Cheap fast generation at volume. Haiku 4.5, GPT-5-mini, Qwen 3.5 7B locally. You are paying for throughput, not depth.
- Long context (1M+ tokens). Gemini 3.1 Pro is the production answer. Llama 4 Maverick is the open answer.
- Indian-language work. Qwen and Llama both do reasonable Hindi / Tamil / Bengali. Gemini handles Indic scripts cleanly. None of them are great at Hinglish (mixed-script) input — that's still a mess.
I would not pick a model based on a benchmark chart. I would pick two or three from the list above, run them on your actual workload for a week, and let the one that costs least to keep on staff win.
The pricing floor keeps falling
A useful frame: every six months, the model that was frontier-class becomes the price-floor model. Sonnet 4.6 today does what Opus 4 did eighteen months ago, at a fraction of the price. The same will be true of whatever Anthropic ships next.
The implication for anything you build: don't optimize your prompts for a specific model unless you have to. The model you target today will be obsolete by the time you ship; the model two tiers below it will probably be free.
What's next
This post sorted out who the models are. The next one zooms into the hardware you'd need to run them: what CPU, GPU, RAM, and VRAM each actually do for inference, and why VRAM is the bottleneck almost every time.
From the dictionary
Terms used in this post
Quick reference for the 18 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- AttentionDL
- The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
- ClaudeAI
- Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
- e.g. This blog's create-post skill drafts inline using Claude.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- Edge AIAI
- Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GeminiAI
- Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
- e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
- GPTAI
- OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- VRAMGeneral
- Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Where AI actually runs: cloud, local, edge
March 16, 2026
- 02
What leaves your machine when you use AI
March 31, 2026
- 03
LLM APIs and the economics of tokens
March 28, 2026
- 04
The runtimes: llama.cpp, Ollama, LM Studio
March 26, 2026
- 05
Why Apple Silicon punches above its weight on local LLMs
March 23, 2026
- 06
What it takes to run a model on your machine
March 21, 2026
- 07
The major LLMs in 2026
March 18, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...