March 18, 20266 min read

The major LLMs in 2026

A field guide to the closed frontier models and the open weights you can actually run. What the "B" numbers mean, and which size fits your machine.

"Which LLM should I use" is the wrong first question. Ask this instead: closed or open, and at what size. Answer that and the shortlist writes itself.

This is post 2 of 7 in the AI Running series. The last one sorted out where AI runs: cloud, local, or the edge. This one is the model tour. Who the players are in 2026, what their numbers actually mean, and what each one is good at.

The closed frontier

Closed vs open weights

Closed models live behind APIs. You don't get the weights, you don't run them, you pay per token. What you get back is the absolute frontier of capability.

Three labs matter.

Anthropic, Claude. Opus 4.7 is the current heavyweight, sitting around 57 on Artificial Analysis' Intelligence Index and leading SWE-bench Pro at 64.3%. Sonnet 4.6 is the daily driver: cheaper, fast, smart enough for almost everything. Haiku 4.5 is the small-fast tier.
OpenAI, GPT. GPT-5.5 sits at the top of the leaderboard. It hits 60 on the Intelligence Index in xhigh-reasoning mode and 82.7% on Terminal-Bench 2.0. GPT-5.4 and GPT-5-mini fill out the tier below.
Google, Gemini. Gemini 3.1 Pro lands at 57 on the Intelligence Index, leads scientific reasoning at 94.3% on GPQA Diamond, and ships the longest production context window going: 1M tokens, with experimental builds at 10M.

You will not run any of these on your laptop. Opus 4.7 needs eight to thirty-two H100s just to load the weights. The frontier is a datacenter thing, full stop.

So what does "closed" actually buy you in 2026?

Tool use that works on the first try. The frontier models are trained hard on agentic loops, and you feel it.
Long-context reasoning that doesn't fall apart at 200K tokens.
A pricing curve that drops every six months while you do nothing.

And what does it cost you? Privacy, because every token leaves your machine. Vendor lock-in, because your prompts end up tuned to one model's quirks. And a per-call bill that climbs in a straight line with usage.

The open weights

Open models ship the weights. Download them, run them, fine-tune them, bake them into a product. Nobody hands you a permission slip, and nobody can take it back.

Here's the 2026 lineup, in rough order of how often you'll actually hear the names.

Llama 4 (Meta). Maverick is 400B total / 17B active (Mixture-of-Experts), 10M-token context. Scout is the smaller variant. Llama is the default open model: best documented, best supported in every runtime.
Qwen 3.5 (Alibaba). Ships dense (0.5B to 32B) and MoE (235B-A22B, 397B-A17B). Qwen punches well above its weight on reasoning and coding benchmarks. The dense small models are what you actually run on a laptop.
DeepSeek V4 (DeepSeek). 1.6T total / 49B active, with Hybrid Compressed Sparse Attention that cuts inference FLOPs 27% vs V3.2 at 1M tokens. The reasoning leader on the open side. Each release has dragged the price floor down for everyone.
Mistral Large 3 (Mistral). 675B / 41B MoE, European, strong on European languages and code.
Gemma 4 (Google). Smaller, dense, built for on-device. The one you reach for when you want open weights without the MoE complexity.
Kimi K2.6 (Moonshot). Highest-ranked open-weights model on the Intelligence Index right now, at 54. Worth knowing. Not the default yet.

What "B" actually means

When you see "Llama 4 Maverick: 400B / 17B", those two numbers tell different stories.

The first is total parameters: every weight in the model file. That sets your storage and your VRAM floor. 400B parameters at FP16 is 800GB on disk. Quantized to 4-bit it drops to around 200GB. Either way, that file is not loading on a 16GB laptop.

The second is active parameters: how many of those weights actually fire on any given token. Mixture-of-Experts models route each token through a small slice of the network, not the whole thing. 17B active means the cost-per-token math behaves like a 17B dense model, even though the file weighs 400B.

The rule of thumb:

Total parameters set the VRAM floor. You have to fit the file.
Active parameters set inference cost and speed. Only those compute on each token.
Dense models have total = active. MoE models split the two apart.

This is why DeepSeek V4 at 1.6T total / 49B active runs on the same hardware a much larger dense model would need, but at a fraction of the per-token cost. MoE is the default at frontier scale in 2026 for exactly this reason: it decouples size from cost.

Sizes that matter on a laptop

Frontier models live in datacenters. The ones you'll actually run locally are smaller. Here are the useful tiers, with rough VRAM at 4-bit quantization.

1B to 3B. Tiny. Runs on anything, a phone included. Good for autocomplete, classification, structured extraction. Not for reasoning. (Gemma 4 1B, Qwen 3.5 0.5B/1.5B/3B.)
7B to 8B. The sweet spot for laptops. Around 5GB on disk, 6 to 8GB VRAM. Smart enough for most chat, drafting, summarization. (Llama 4 8B, Qwen 3.5 7B, Mistral 7B.)
13B to 32B. The serious local tier. 8 to 24GB VRAM. Noticeably better at multi-step reasoning. (Qwen 3.5 32B, Gemma 4 27B.)
70B class. Workstation territory. 40 to 50GB at 4-bit. Needs an M-series Mac with 64GB+ unified memory or a multi-GPU rig. (Llama 4 70B equivalent, DeepSeek-Distill 70B.)
MoE flagships (200B to 1.6T total). Server territory. You don't run these unless you've got a real machine.

What each is actually good at

Benchmarks are noisy. Here's the working version, the kind you only get from sitting with these models day to day.

Coding agents and tool use. Claude Opus 4.7, then GPT-5.5, then DeepSeek V4 on the open side. Anthropic's lead on SWE-bench is real, and you feel it in long agentic loops.
Pure reasoning and math. GPT-5.5 in high-reasoning mode, Gemini 3.1 Pro for scientific topics, DeepSeek V4 if you want open.
Cheap fast generation at volume. Haiku 4.5, GPT-5-mini, Qwen 3.5 7B locally. You're paying for throughput here, not depth.
Long context (1M+ tokens). Gemini 3.1 Pro is the production answer. Llama 4 Maverick is the open one.
Indian-language work. Qwen and Llama both do reasonable Hindi, Tamil, and Bengali. Gemini handles Indic scripts cleanly. None of them are great at Hinglish (mixed-script) input. That's still a mess.

I wouldn't pick a model off a benchmark chart. I'd pick two or three from the list above, run them on my actual workload for a week, and keep the one that costs least to keep on staff.

The pricing floor keeps falling

A useful frame: every six months, the model that was frontier-class becomes the price-floor model. Sonnet 4.6 today does what Opus 4 did eighteen months ago, at a fraction of the price. Whatever Anthropic ships next will tell the same story.

So here's what it means for anything you build. Don't tune your prompts for one specific model unless you genuinely have to. The model you target today will be obsolete by the time you ship, and the model two tiers below it will probably be free.

What's next

This post sorted out who the models are. The next one drops down to the hardware you'd need to run them: what the CPU, GPU, RAM, and VRAM each do for inference, and why VRAM is the bottleneck almost every single time.

AI AI Running Benchmarks LLM Open Weights

From the dictionary

Terms used in this post

Quick reference for the 18 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
AttentionDL: The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Edge AIAI: Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.; e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

AI Running

2 / 7 posts

Browse all in AI Running →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
AttentionDL: The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Edge AIAI: Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

AI Running

2 / 7 posts

Browse all in AI Running →