Picking a local model by task
The 2026 open leaders by task: coding (Qwen 2.5 Coder, DeepSeek-Coder), chat (Llama, Qwen, Mistral), small-model renaissance (Phi-3, Gemma 2), structured output, multimodal, embeddings.

There is no single "best local LLM" in 2026. There is a best model for coding, a best model for chat, a different one for vision, and another for embeddings. This post is the shortlist, by task, with honest notes on what each gets right.
This is post 5 of 13 in the Local LLMs series. After this one, you'll know which model to pull on day one for whatever you're trying to do.
How to read a model card
Before the shortlist, the meta-skill: reading a HuggingFace model card. Every card has the same eight things buried somewhere:
- Parameter count. "8B", "70B", "MoE 235B-A22B." From post 2.
- Base or instruct. Look for "Instruct," "Chat," "IT," or no suffix (likely base).
- License. Apache 2.0 and MIT are unrestricted. Llama license has commercial restrictions over 700M users. Some models (Gemma, Mistral) have their own licenses. Read before shipping.
- Context length. "128K context," "1M context."
- Languages. "English-only," "multilingual," "29 languages."
- Training cutoff. What date the training data ends. Determines what the model "knows about."
- Benchmarks. MMLU, HumanEval, MATH, GPQA, IFEval. Take with salt , overfitting is real.
- Available formats. Original safetensors, GGUF (community-quantized), MLX (Apple).
If the card answers all eight, the model is well-documented. If it doesn't, be cautious , undocumented models often have weird quirks.
Coding

The category where 2026 open models actually rival closed ones for many tasks.
- Qwen 2.5 Coder. The current open coding leader. Sizes from 0.5B to 32B. The 7B punches well above its weight; the 32B is genuinely good at multi-file edits.
- DeepSeek-Coder V3. Strong on benchmarks, especially for less common languages (Rust, Go, Haskell). Larger sizes need real hardware.
- Codestral 22B (Mistral). Mistral's coding-specific model. Excellent at fill-in-the-middle (the autocomplete pattern). Strong on European languages and SQL.
- Llama 3.x base + your own fine-tune. If you have unusual codebases, fine-tuning a Llama base on your repo can outperform any general-purpose coder. Niche but powerful.
For autocomplete on a 16GB Mac, Qwen 2.5 Coder 7B Q4 is the easy default. For a 24GB GPU, the 14B variant is sharper. For a 64GB+ machine, the 32B feels like a real coding assistant.
General chat and reasoning
The "I just want a smart assistant" category.
- Llama 3.x Instruct. The default. Sizes 1B, 3B, 8B, 70B. Best documentation, every runtime supports it, lots of fine-tunes available.
- Qwen 2.5 Instruct. The strongest open model in many benchmarks at the same parameter count. Multilingual is real (29 languages). Underrated.
- Mistral Small / Medium / Large. Strong reasoning, particularly good at following structured instructions.
- DeepSeek V4. When you need the best reasoning available in open weights and you have the hardware to run it.
For a 16GB MacBook Air, Llama 3.2 3B Instruct or Qwen 2.5 3B Instruct. For 24GB+ Mac/PC, Qwen 2.5 14B is the sweet spot.
The small-model renaissance
The 2024–2026 progression in small models is the part most people missed. A 3B model in 2024 was a toy. A 3B model in 2026 is a real assistant. Three families to know:
- Phi-3 / Phi-3.5 (Microsoft). Phi-3 mini is 3.8B, distilled from synthetic data generated by GPT-class models. Punches at the level of much larger models on reasoning. Not great at "knowledge" (it forgets facts) but excellent at thinking.
- Gemma 2 (Google). Sizes 2B, 9B, 27B. Distilled from Gemini. The 2B is the smallest model I'd recommend for chat; the 9B is competitive with much bigger models on many tasks.
- Qwen 2.5 small variants. 0.5B, 1.5B, 3B. Useful for classification and structured extraction at a size that runs on a phone.
If your hardware is modest, you don't need to settle for a worse experience. The small models in 2026 are good. Phi-3 mini on a 5-year-old laptop is a real assistant.
Summarization
Most "general chat" models do this fine. Where it gets interesting is long-document summarization, where context length matters more than parameter count.
- Qwen 2.5 7B / 14B with extended context. Solid quality up to 128K tokens.
- Llama 4 Scout (when it lands locally). 1M context, designed for long documents.
- Phi-3.5-MoE. Smaller but excellent at extracting key points from long inputs.
For summarizing a 100K-token document on a 16GB Mac, Qwen 2.5 7B at 32K context with chunking is the practical answer. Don't try to load the full 128K context unless you have headroom.
Structured output (JSON, function calls)
Where open models historically struggled. They've gotten better.
- Qwen 2.5 Instruct. Probably the best open model at producing valid JSON on the first try.
- Hermes 3 (Llama-based). Trained specifically for tool use and structured output. Beats vanilla Llama Instruct on these tasks.
- Mistral models with constrained generation. llama.cpp supports grammar-constrained output, which forces any model to produce valid JSON. Slower but reliable.
If you need 100% valid JSON, use grammar-constrained generation. If you need 95% valid JSON and want speed, use Qwen 2.5 14B with a clear system prompt and a one-shot example.
Multimodal (vision)
Models that take images as input alongside text. The 2026 open leaders:
- Qwen 2.5 VL. Sizes 3B, 7B, 72B. The best open vision-language model in 2026, with strong OCR for Asian scripts.
- LLaVA-Next. The community-favorite vision wrapper around various base models. Less polished than Qwen-VL but more flexible.
- Pixtral 12B (Mistral). Mistral's vision model. Good for Western languages and document understanding.
Local vision models eat memory: you're loading the language model plus the vision encoder plus the image features in the KV cache. A 7B vision model needs roughly 12 GB at Q4 in practice.
Embeddings
A different kind of model: input is text, output is a vector of numbers used for similarity search. We hit RAG fully in post 10; for now, just know which embedding models to use.
- nomic-embed-text v1.5. The current open default. Small (137M), fast, good quality. Apache 2.0 licensed.
- bge-large-en-v1.5. Higher quality than nomic, larger (335M). Pick for English-heavy use cases.
- bge-m3. Multilingual, good for Hindi and other non-Latin scripts.
These run on essentially any hardware. They're meant to be paired with a chat model, not used standalone.
Audio (speech-to-text and back)
Not always covered in "LLM" discussions, but worth knowing:
- Whisper Large v3 (OpenAI, open weights). The transcription standard. Runs locally via whisper.cpp.
- Parakeet (NVIDIA). Newer, faster on English. Open weights, growing adoption.
- Piper / Coqui TTS. Open text-to-speech for the other direction.
For a fully-local "talk to your AI" stack: Whisper for input, an LLM in the middle, Piper for output. All three run comfortably on a 16GB Mac.
A working starter set
If you're just beginning local LLMs, three models cover most needs:
# pull a small all-rounder for chat
ollama pull llama3.2:3b
# pull a coding model
ollama pull qwen2.5-coder:7b
# pull an embedding model for rag later
ollama pull nomic-embed-text
Total disk: under 10 GB. Total VRAM at runtime: 6–8 GB. Runs on every machine in this series. You can be productive with this set for weeks before needing anything else.
What's next
Now you know which model to want. The next post is whether your machine can actually run it: Mac, Windows, Linux, and what each OS demands of a local-LLM setup.
From the dictionary
Terms used in this post
Quick reference for the 20 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- DatasetData
- The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GeminiAI
- Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
- e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
- GGUFML
- GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
- GPTAI
- OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- llama.cppAI
- A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- System PromptNLP
- A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- VRAMGeneral
- Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Troubleshooting local LLMs and keeping up
May 15, 2026
- 02
Fine-tuning a model locally
May 12, 2026
- 03
Local agents and tool use
May 8, 2026
- 04
Local RAG and embeddings
May 5, 2026
- 05
Integrating a local LLM into your workflow
May 1, 2026
- 06
Your first local LLM, end to end
April 28, 2026
- 07
Every machine can run a local LLM (here's what fits)
April 24, 2026
- 08
System requirements by OS for local LLMs
April 21, 2026
- 09
Picking a local model by task
April 17, 2026
- 10
Streaming, throughput, and the KV cache
April 14, 2026
- 11
Quantization, distillation, pruning: making models fit
April 10, 2026
- 12
The local-LLM vocabulary
April 7, 2026
- 13
The pitch for local LLMs in 2026
April 3, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...