April 17, 20266 min read

Picking a local model by task

The 2026 open leaders, sorted by what you actually want to do: coding, chat, the small-model crowd, structured output, vision, embeddings, and audio.

Stop looking for the one best local LLM. It doesn't exist in 2026. There's a best model for coding. A different best for chat. Another for vision, another again for embeddings. What you actually need is a shortlist sorted by job, with honest notes on what each one is good at and where it falls down. That's this post.

This is post 5 of 13 in the Local LLMs series. By the end you'll know exactly which model to pull on day one for whatever you're trying to do.

How to read a model card

Quick skill first, because it saves you from a lot of bad pulls: reading a HuggingFace model card. They all bury the same eight facts somewhere on the page.

Parameter count. "8B", "70B", "MoE 235B-A22B." We covered why this matters back in post 2.
Base or instruct. Look for "Instruct," "Chat," or "IT." No suffix usually means base.
License. Apache 2.0 and MIT are unrestricted. The Llama license adds commercial restrictions once you cross 700M users. Gemma and Mistral ship their own licenses. Read these before you ship anything.
Context length. "128K context," "1M context."
Languages. "English-only," "multilingual," "29 languages."
Training cutoff. The date the training data stops. This is what the model "knows about."
Benchmarks. MMLU, HumanEval, MATH, GPQA, IFEval. Take them with salt. Overfitting is real.
Available formats. Original safetensors, GGUF (community-quantized), MLX (Apple).

Card answers all eight? Well-documented model, pull with confidence. Card skips half of them? Be careful. Undocumented models tend to come with weird quirks you only find at 2am.

Coding

Task to model shortlist

This is the category where 2026 open models genuinely rival the closed ones for a lot of real work.

Qwen 2.5 Coder. The open coding leader right now. Sizes run 0.5B to 32B. The 7B punches well above its weight, and the 32B is actually good at multi-file edits.
DeepSeek-Coder V3. Strong on benchmarks, especially for the less common languages like Rust, Go, and Haskell. The bigger sizes want real hardware.
Codestral 22B (Mistral). Mistral's coding-specific model. Great at fill-in-the-middle, which is the autocomplete pattern. Strong on European languages and SQL too.
Llama 3.x base plus your own fine-tune. Got an unusual codebase? Fine-tuning a Llama base on your own repo can beat any general-purpose coder. Niche, but powerful when it fits.

On a 16GB Mac, Qwen 2.5 Coder 7B Q4 is the easy default for autocomplete. On a 24GB GPU, the 14B is sharper. On a 64GB+ machine, the 32B finally feels like a real coding assistant sitting next to you.

General chat and reasoning

The "I just want a smart assistant" bucket.

Llama 3.x Instruct. The default. Sizes 1B, 3B, 8B, 70B. Best docs, every runtime supports it, and there are fine-tunes for days.
Qwen 2.5 Instruct. The strongest open model on a lot of benchmarks at the same parameter count. The multilingual claim is real, 29 languages. Genuinely underrated.
Mistral Small / Medium / Large. Strong reasoning, and particularly good at following structured instructions.
DeepSeek V4. Reach for this when you want the best open-weights reasoning available and you've got the hardware to feed it.

For a 16GB MacBook Air, go Llama 3.2 3B Instruct or Qwen 2.5 3B Instruct. For 24GB+ on Mac or PC, Qwen 2.5 14B is the sweet spot.

The small-model renaissance

Here's the part most people slept through. The jump in small models from 2024 to 2026 is wild. A 3B model in 2024 was a toy. A 3B model in 2026 is a real assistant. Three families worth knowing:

Phi-3 / Phi-3.5 (Microsoft). Phi-3 mini is 3.8B, distilled from synthetic data that GPT-class models generated. It reasons at the level of much bigger models. Weak on "knowledge," it forgets facts, but excellent at thinking through a problem.
Gemma 2 (Google). Sizes 2B, 9B, 27B. Distilled from Gemini. The 2B is the smallest model I'd actually recommend for chat, and the 9B holds its own against far bigger models on plenty of tasks.
Qwen 2.5 small variants. 0.5B, 1.5B, 3B. Handy for classification and structured extraction at a size that runs on a phone.

If your hardware is modest, you don't have to settle for a worse experience. The small models in 2026 are good. Phi-3 mini on a five-year-old laptop is a real assistant, not a party trick.

Summarization

Most "general chat" models handle this fine. It gets interesting with long documents, where context length starts to matter more than parameter count.

Qwen 2.5 7B / 14B with extended context. Solid quality up to 128K tokens.
Llama 4 Scout (once it lands locally). 1M context, built for long documents.
Phi-3.5-MoE. Smaller, but excellent at pulling key points out of long inputs.

Summarizing a 100K-token document on a 16GB Mac? The practical answer is Qwen 2.5 7B at 32K context with chunking. Don't try to load the full 128K unless you've got real headroom to spare.

Structured output (JSON, function calls)

Open models used to be bad at this. They've come a long way.

Qwen 2.5 Instruct. Probably the best open model at producing valid JSON on the first try.
Hermes 3 (Llama-based). Trained specifically for tool use and structured output. Beats vanilla Llama Instruct on these.
Mistral models with constrained generation. llama.cpp supports grammar-constrained output, which forces any model to produce valid JSON. Slower, but reliable.

Need 100% valid JSON? Use grammar-constrained generation, full stop. Happy with 95% and want the speed? Qwen 2.5 14B with a clear system prompt and one example to copy.

Multimodal (vision)

Models that take images as input alongside text. The 2026 open leaders:

Qwen 2.5 VL. Sizes 3B, 7B, 72B. The best open vision-language model in 2026, with strong OCR for Asian scripts.
LLaVA-Next. The community-favorite vision wrapper around various base models. Less polished than Qwen-VL, but more flexible.
Pixtral 12B (Mistral). Mistral's vision model. Good for Western languages and document understanding.

Heads up: local vision models eat memory. You're loading the language model, plus the vision encoder, plus the image features sitting in the KV cache. A 7B vision model needs roughly 12 GB at Q4 in practice.

Embeddings

A different animal. Input is text, output is a vector of numbers you use for similarity search. We hit RAG properly in post 10. For now, just know which embedding models to grab.

nomic-embed-text v1.5. The current open default. Small (137M), fast, good quality. Apache 2.0 licensed.
bge-large-en-v1.5. Higher quality than nomic, bigger at 335M. Pick it for English-heavy work.
bge-m3. Multilingual, good for Hindi and other non-Latin scripts.

These run on basically any hardware. They're meant to pair with a chat model, not work alone.

Audio (speech-to-text and back)

Not always lumped into "LLM" talk, but worth keeping on your radar:

Whisper Large v3 (OpenAI, open weights). The transcription standard. Runs locally via whisper.cpp.
Parakeet (NVIDIA). Newer, faster on English. Open weights, picking up adoption fast.
Piper / Coqui TTS. Open text-to-speech for going the other way.

Want a fully-local "talk to your AI" stack? Whisper for input, an LLM in the middle, Piper for output. All three run comfortably on a 16GB Mac.

A working starter set

Just getting going with local LLMs? Three models cover most of what you'll want.

# pull a small all-rounder for chat

ollama pull llama3.2:3b

# pull a coding model

ollama pull qwen2.5-coder:7b

# pull an embedding model for rag later

ollama pull nomic-embed-text

Total disk: under 10 GB. Total VRAM at runtime: 6 to 8 GB. This runs on every machine in this series, and you can stay productive with it for weeks before you need anything heavier.

So now you know which model to want. The next post asks the harder question: can your machine actually run it? Mac, Windows, Linux, and what each one demands from a local-LLM setup.

AI Coding LLM Local Llms Model Selection Multimodal

From the dictionary

Terms used in this post

Quick reference for the 20 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
DatasetData: The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.; e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP: A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

Local Llms

5 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
DatasetData: The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP: A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

Local Llms

5 / 13 posts

Browse all in Local Llms →