Every machine can run a local LLM (here's what fits)
Per-tier guide: 8GB integrated graphics, 16GB MacBook Air, 8/12/16/24/32GB VRAM PCs, 24/32/64/128/192GB Macs. Specific models, specific tok/s, specific configs. Every tier runs something useful.

Tell me your machine and I'll tell you what runs. This post is the per-tier guide: every common laptop and PC class in 2026, with specific models that work, specific configurations that work, and a real "what you can do today" answer for each.
This is post 7 of 13 in the Local LLMs series. The previous post sorted out OS prerequisites. This one is the hardware match-up.
How to read this guide

For every tier, three things:
- What you can do today. The model and config I'd actually run on this machine.
- Comfortable workload. Not pushing limits, runs alongside other apps.
- Stretch. What's possible if you close everything else and accept slower performance.
Each tier is good. There is no "you'd better upgrade." The smaller machines run smaller models that are still genuinely useful in 2026.
Tier 1: 8 GB RAM, integrated graphics laptop
Old laptops, cheap Chromebooks (Linux on them), entry-level Windows machines. No discrete GPU.
What you can do today. Phi-3 mini (3.8B) at Q4. About 1.7 GB on disk, 2.5 GB at runtime. Runs at 4–10 tok/s on CPU. Slow by 4090 standards, perfectly usable for drafting, summarization, classification.
Comfortable workload.
- Phi-3 mini Q4 for chat and short-form text.
- Llama 3.2 1B for ultra-fast classification (12+ tok/s on CPU).
- Qwen 2.5 0.5B for embeddings or extreme-speed structured extraction.
Stretch. Llama 3.2 3B Q4 (~2 GB). Slow at maybe 3–6 tok/s but it works.
Setup. Ollama on Linux or Windows native. CPU backend.
Tier 2: 16 GB MacBook Air (M1/M2/M3/M4)
The cheapest "real local LLM" machine on the market. Unified memory means the GPU sees about 12 GB of usable space.
What you can do today. Llama 3.2 3B Q4 (~2 GB) for general chat at 35 tok/s. Qwen 2.5 Coder 7B Q4 (~5 GB) for coding at 25 tok/s. Both feel fluid in chat UIs.
Comfortable workload.
- Llama 3.2 3B for everything.
- Qwen 2.5 Coder 7B for code.
- Phi-3 medium 14B Q3 (around 7 GB) for harder reasoning, ~12 tok/s.
- nomic-embed-text 137M alongside any of the above for RAG.
Stretch. Llama 3.1 8B Q4 (~5 GB) at 18–22 tok/s. Mistral 7B Q4. With Q8 KV cache enabled, you can run these at 32K context comfortably.
Setup. Ollama via brew. Metal backend, automatic.
Tier 3: 16 GB RAM PC with 8 GB VRAM (RTX 3060, 4060, 4060 Laptop)
The mid-budget gaming or work laptop. Discrete NVIDIA GPU is the differentiator.
What you can do today. Qwen 2.5 7B Q4 (~5 GB) entirely in VRAM at 50–80 tok/s on a desktop 4060, 35–55 on a laptop. Real coding speed.
Comfortable workload.
- Llama 3.1 8B Q4 in VRAM. ~60 tok/s on a 4060 Ti.
- Qwen 2.5 Coder 7B for coding work.
- Phi-3 medium 14B Q3 with partial CPU offload (~25 tok/s).
Stretch. 13B-class models at Q4 with some CPU offload. Watch tok/s drop to 15–20 but quality goes up.
Setup. Ollama or LM Studio on Windows / Linux. NVIDIA driver + CUDA backend.
Tier 4: 24 GB MacBook Pro M-series
The default Pro tier. About 18 GB GPU-addressable.
What you can do today. Qwen 2.5 14B Q4 (~9 GB) at 25–35 tok/s. Real reasoning improvement over 8B class.
Comfortable workload.
- Llama 3.1 8B Q5 or Q8 for chat (sharper than Q4).
- Qwen 2.5 Coder 14B for serious code work.
- Gemma 2 9B Q5.
- Multiple smaller models loaded simultaneously (chat + embeddings + classifier).
Stretch. Llama 3.1 70B Q2 (~24 GB). Tight, slow at 5–8 tok/s, but it runs. Better: Qwen 2.5 32B Q3 (~14 GB) at 12–15 tok/s.
Tier 5: 12 GB VRAM PC (RTX 4070, 3060 12GB, 4070 Laptop)
The popular mid-tier desktop / laptop GPU.
What you can do today. Qwen 2.5 14B Q4 entirely in VRAM (~9 GB) at 45–60 tok/s. No CPU offload required.
Comfortable workload.
- Llama 3.1 8B Q5_K_M at 70+ tok/s.
- Qwen 2.5 Coder 14B Q4.
- Mistral Small 22B Q3 with light CPU offload (~25 tok/s).
- Vision: Qwen 2.5 VL 7B for multimodal.
Stretch. 32B-class models at Q3 with CPU offload (~12 tok/s). Better off picking a smaller model at higher quant.
Tier 6: 32 GB MacBook Pro M Pro / Max, or 16 GB VRAM PC (RTX 4080, 5080)
The "comfortable serious user" tier.
What you can do today. Qwen 2.5 32B Q4 (~18 GB) at 18–25 tok/s on Mac, 35–50 on a 4080. Multi-step reasoning starts feeling reliable.
Comfortable workload.
- Qwen 2.5 32B for general work.
- Qwen 2.5 Coder 32B for serious coding tasks.
- Codestral 22B for autocomplete-heavy workflows.
- Vision models at higher quants.
Stretch. Llama 3.1 70B Q2 (~24 GB) on the 32GB Mac at 7–10 tok/s. The 4080 needs CPU offload to fit but is slightly faster than Q2 alone.
Tier 7: 64 GB MacBook Pro M Max, or 24 GB VRAM PC (RTX 3090, 4090, 5080 24GB)
The sweet spot for serious local-LLM work.
What you can do today. Llama 3.1 70B Q4 (~40 GB) on the 64GB Mac at 10–15 tok/s. On the 4090, 70B Q4 needs CPU offload but works at ~12 tok/s; the 4090 is faster on smaller models , Qwen 2.5 32B Q5 at ~50 tok/s.
Comfortable workload.
- Llama 3.1 70B Q4 (Mac) or Qwen 2.5 32B Q5 (4090) as daily driver.
- Coding agents that actually plan multi-step.
- Long-context work up to 64K with KV cache quant.
- Multiple concurrent models for serving.
Stretch. 4090: 70B Q4 with full CPU offload, ~10 tok/s. 64GB Mac: 70B Q5 (~48 GB) at 7–10 tok/s.
Tier 8: 128 GB Mac Studio M Ultra, or 32 GB VRAM PC (RTX 5090)
Workstation territory. Either is genuinely good for solo developer work.
What you can do today. Llama 3.1 70B Q8 (~70 GB) on the 128GB Mac at 8–12 tok/s. On the 5090, 70B Q4 mostly fits in VRAM and runs at ~25 tok/s.
Comfortable workload.
- Llama 3.1 70B at Q8 quality.
- Qwen 2.5 72B Q5.
- Multi-agent systems with several models concurrent.
- Local fine-tuning with QLoRA on 7B–14B models (post 12).
Tier 9: 192 GB Mac Studio M Ultra, or multi-GPU rigs
The top consumer ladder. Niche but real.
What you can do today. Llama 4 405B Q4 (~200 GB), 5–8 tok/s on the 192GB Mac. DeepSeek V4 / Llama 4 Maverick MoE models if you have the disk space.
Comfortable workload.
- Frontier-quality models running fully in unified memory.
- Local serving for a small team.
- Real fine-tuning experiments.
For most readers, this tier is overkill. I include it because the question "is it possible to run frontier-class open models at home?" gets a yes here.
What runs on what: a quick-reference table
| Tier | Comfortable model | tok/s estimate |
|---|---|---|
| 8GB integrated | Phi-3 mini Q4 | 4–10 |
| 16GB Air | Llama 3.2 3B Q4 / Qwen Coder 7B Q4 | 25–35 |
| 16GB+8GB VRAM | Qwen 2.5 7B Q4 | 50–80 |
| 24GB Mac | Qwen 2.5 14B Q4 | 25–35 |
| 12GB VRAM | Qwen 2.5 14B Q4 | 45–60 |
| 32GB Mac / 16GB VRAM | Qwen 2.5 32B Q4 | 20–50 |
| 64GB Mac / 24GB VRAM | Llama 3.1 70B Q4 / Qwen 32B Q5 | 12–50 |
| 128GB Mac / 32GB VRAM | Llama 3.1 70B Q8 | 12–25 |
| 192GB+ | Llama 4 405B Q4 | 5–8 |
A pep talk
I've taught this stuff for a few years and the most common reaction at tier 1 or tier 2 is "but I can't run anything good." That hasn't been true since around mid-2024. A 16GB MacBook Air running Qwen 2.5 Coder 7B is genuinely a real coding assistant. A laptop with integrated graphics running Phi-3 mini is genuinely a real drafting assistant. The 2024–2026 small-model wave changed the floor.
Pick your tier. Run the recommended model. Use it for a week. Then come back if you need more capacity.
What's next
Enough theory. The next post is the hands-on Hello World: install Ollama, pull a model, have a chat, troubleshoot the common errors. End-to-end setup in one post.
From the dictionary
Terms used in this post
Quick reference for the 11 terms you met above. Each one comes from the AI dictionary.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- LM StudioAI
- A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- VRAMGeneral
- Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
Rate this article
How helpful did you find this?
- 01
Troubleshooting local LLMs and keeping up
May 15, 2026
- 02
Fine-tuning a model locally
May 12, 2026
- 03
Local agents and tool use
May 8, 2026
- 04
Local RAG and embeddings
May 5, 2026
- 05
Integrating a local LLM into your workflow
May 1, 2026
- 06
Your first local LLM, end to end
April 28, 2026
- 07
Every machine can run a local LLM (here's what fits)
April 24, 2026
- 08
System requirements by OS for local LLMs
April 21, 2026
- 09
Picking a local model by task
April 17, 2026
- 10
Streaming, throughput, and the KV cache
April 14, 2026
- 11
Quantization, distillation, pruning: making models fit
April 10, 2026
- 12
The local-LLM vocabulary
April 7, 2026
- 13
The pitch for local LLMs in 2026
April 3, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...