April 24, 20267 min read

Every machine can run a local LLM (here's what fits)

A per-tier guide to running local LLMs in 2026, from 8GB integrated graphics to a 192GB Mac Studio. Specific models, specific speeds, specific configs.

Tell me your machine and I'll tell you what runs. That's the whole post. Every common laptop and PC class in 2026, with the specific models that work on it, the configs that work, and a real "what can I do today" answer for each.

This is post 7 of 13 in the Local LLMs series. By now you know which model you want and what your OS needs to run it. This one is the hardware match-up: what actually runs on the machine in front of you.

How to read this guide

Hardware tier ladder

Three things for every tier:

What you can do today. The model and config I'd actually run on this machine.
Comfortable workload. Not pushing limits, runs alongside your other apps.
Stretch. What's possible if you close everything else and accept slower performance.

Every tier here is good. There's no "you'd better upgrade" hiding in this post. The smaller machines run smaller models, and those models are still genuinely useful in 2026.

Tier 1: 8 GB RAM, integrated graphics laptop

Old laptops, cheap Chromebooks with Linux on them, entry-level Windows machines. No discrete GPU.

What you can do today. Phi-3 mini (3.8B) at Q4. About 1.7 GB on disk, 2.5 GB at runtime. It runs at 4 to 10 tok/s on CPU. Slow if you're used to a 4090, but perfectly usable for drafting, summarization, and classification.

Comfortable workload.

Phi-3 mini Q4 for chat and short-form text.
Llama 3.2 1B for ultra-fast classification (12+ tok/s on CPU).
Qwen 2.5 0.5B for embeddings or extreme-speed structured extraction.

Stretch. Llama 3.2 3B Q4 (~2 GB). Slow, maybe 3 to 6 tok/s, but it works.

Setup. Ollama on Linux or Windows native. CPU backend.

Tier 2: 16 GB MacBook Air (M1/M2/M3/M4)

The cheapest "real local LLM" machine on the market. Unified memory means the GPU sees about 12 GB of usable space.

What you can do today. Llama 3.2 3B Q4 (~2 GB) for general chat at 35 tok/s. Qwen 2.5 Coder 7B Q4 (~5 GB) for coding at 25 tok/s. Both feel fluid in a chat UI.

Comfortable workload.

Llama 3.2 3B for everything.
Qwen 2.5 Coder 7B for code.
Phi-3 medium 14B Q3 (around 7 GB) for harder reasoning, ~12 tok/s.
nomic-embed-text 137M alongside any of the above for RAG.

Stretch. Llama 3.1 8B Q4 (~5 GB) at 18 to 22 tok/s. Mistral 7B Q4. Turn on Q8 KV cache and you can run these at 32K context comfortably.

Setup. Ollama via brew. Metal backend, automatic.

Tier 3: 16 GB RAM PC with 8 GB VRAM (RTX 3060, 4060, 4060 Laptop)

The mid-budget gaming or work laptop. The discrete NVIDIA GPU is what changes the game here.

What you can do today. Qwen 2.5 7B Q4 (~5 GB) entirely in VRAM at 50 to 80 tok/s on a desktop 4060, 35 to 55 on a laptop. That's real coding speed.

Comfortable workload.

Llama 3.1 8B Q4 in VRAM. Around 60 tok/s on a 4060 Ti.
Qwen 2.5 Coder 7B for coding work.
Phi-3 medium 14B Q3 with partial CPU offload (~25 tok/s).

Stretch. 13B-class models at Q4 with some CPU offload. Watch tok/s drop to 15 to 20, but the quality goes up.

Setup. Ollama or LM Studio on Windows or Linux. NVIDIA driver plus CUDA backend.

Tier 4: 24 GB MacBook Pro M-series

The default Pro tier. About 18 GB is GPU-addressable.

What you can do today. Qwen 2.5 14B Q4 (~9 GB) at 25 to 35 tok/s. A real reasoning step up from the 8B class.

Comfortable workload.

Llama 3.1 8B Q5 or Q8 for chat (sharper than Q4).
Qwen 2.5 Coder 14B for serious code work.
Gemma 2 9B Q5.
Several smaller models loaded at once (chat plus embeddings plus a classifier).

Stretch. Llama 3.1 70B Q2 (~24 GB). Tight, and slow at 5 to 8 tok/s, but it runs. Better option: Qwen 2.5 32B Q3 (~14 GB) at 12 to 15 tok/s.

Tier 5: 12 GB VRAM PC (RTX 4070, 3060 12GB, 4070 Laptop)

The popular mid-tier desktop and laptop GPU.

What you can do today. Qwen 2.5 14B Q4 entirely in VRAM (~9 GB) at 45 to 60 tok/s. No CPU offload needed.

Comfortable workload.

Llama 3.1 8B Q5_K_M at 70+ tok/s.
Qwen 2.5 Coder 14B Q4.
Mistral Small 22B Q3 with light CPU offload (~25 tok/s).
Vision: Qwen 2.5 VL 7B for multimodal.

Stretch. 32B-class models at Q3 with CPU offload (~12 tok/s). Honestly you're better off picking a smaller model at a higher quant.

Tier 6: 32 GB MacBook Pro M Pro / Max, or 16 GB VRAM PC (RTX 4080, 5080)

The "comfortable serious user" tier.

What you can do today. Qwen 2.5 32B Q4 (~18 GB) at 18 to 25 tok/s on a Mac, 35 to 50 on a 4080. This is where multi-step reasoning starts feeling reliable.

Comfortable workload.

Qwen 2.5 32B for general work.
Qwen 2.5 Coder 32B for serious coding tasks.
Codestral 22B for autocomplete-heavy workflows.
Vision models at higher quants.

Stretch. Llama 3.1 70B Q2 (~24 GB) on the 32GB Mac at 7 to 10 tok/s. The 4080 needs CPU offload to fit, but it ends up slightly faster than Q2 alone.

Tier 7: 64 GB MacBook Pro M Max, or 24 GB VRAM PC (RTX 3090, 4090, 5080 24GB)

The sweet spot for serious local-LLM work.

What you can do today. Llama 3.1 70B Q4 (~40 GB) on the 64GB Mac at 10 to 15 tok/s. On the 4090, 70B Q4 needs CPU offload but works at ~12 tok/s. The 4090 is faster on smaller models: Qwen 2.5 32B Q5 at ~50 tok/s.

Comfortable workload.

Llama 3.1 70B Q4 (Mac) or Qwen 2.5 32B Q5 (4090) as your daily driver.
Coding agents that actually plan multi-step.
Long-context work up to 64K with KV cache quant.
Multiple concurrent models for serving.

Stretch. On the 4090: 70B Q4 with full CPU offload, ~10 tok/s. On the 64GB Mac: 70B Q5 (~48 GB) at 7 to 10 tok/s.

Tier 8: 128 GB Mac Studio M Ultra, or 32 GB VRAM PC (RTX 5090)

Workstation territory. Either one is genuinely good for solo developer work.

What you can do today. Llama 3.1 70B Q8 (~70 GB) on the 128GB Mac at 8 to 12 tok/s. On the 5090, 70B Q4 mostly fits in VRAM and runs at ~25 tok/s.

Comfortable workload.

Llama 3.1 70B at Q8 quality.
Qwen 2.5 72B Q5.
Multi-agent systems with several models running concurrent.
Local fine-tuning with QLoRA on 7B to 14B models (more on that in a later post).

Tier 9: 192 GB Mac Studio M Ultra, or multi-GPU rigs

The top of the consumer ladder. Niche, but real.

What you can do today. Llama 4 405B Q4 (~200 GB) at 5 to 8 tok/s on the 192GB Mac. DeepSeek V4 or Llama 4 Maverick MoE models if you have the disk space.

Comfortable workload.

Frontier-quality models running fully in unified memory.
Local serving for a small team.
Real fine-tuning experiments.

For most readers this tier is overkill. I include it because the question "can I run frontier-class open models at home?" gets a yes right here.

What runs on what: a quick-reference table

Tier	Comfortable model	tok/s estimate
8GB integrated	Phi-3 mini Q4	4-10
16GB Air	Llama 3.2 3B Q4 / Qwen Coder 7B Q4	25-35
16GB+8GB VRAM	Qwen 2.5 7B Q4	50-80
24GB Mac	Qwen 2.5 14B Q4	25-35
12GB VRAM	Qwen 2.5 14B Q4	45-60
32GB Mac / 16GB VRAM	Qwen 2.5 32B Q4	20-50
64GB Mac / 24GB VRAM	Llama 3.1 70B Q4 / Qwen 32B Q5	12-50
128GB Mac / 32GB VRAM	Llama 3.1 70B Q8	12-25
192GB+	Llama 4 405B Q4	5-8

A quick pep talk

I've taught this stuff for a few years now, and the most common reaction from people on tier 1 or tier 2 is "but I can't run anything good." That stopped being true around mid-2024. A 16GB MacBook Air running Qwen 2.5 Coder 7B is a real coding assistant. A laptop with integrated graphics running Phi-3 mini is a real drafting assistant. The 2024 to 2026 small-model wave moved the floor way up.

Pick your tier. Run the recommended model. Use it for a week. Come back if you actually need more capacity.

What's next

Enough theory. The next post is the hands-on Hello World: install Ollama, pull a model, have a chat, fix the errors that always come up the first time. Full setup, end to end, in one post.

AI Hardware LLM Local Llms VRAM

From the dictionary

Terms used in this post

Quick reference for the 11 terms you met above. Each one comes from the AI dictionary.

Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.

Rate this article

How helpful did you find this?

Series

Local Llms

7 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.

Series

Local Llms

7 / 13 posts

Browse all in Local Llms →