5 min read

Why Apple Silicon punches above its weight on local LLMs

Unified memory means the GPU sees all of RAM. Why that beats discrete-GPU PCs above 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.

Why Apple Silicon punches above its weight on local LLMs

A 16-inch MacBook Pro with 64GB unified memory will run a model that a $1600 RTX 4090 cannot. That outcome surprises people who built PCs in the discrete-GPU era. The reason is one architectural choice: unified memory.

This is post 4 of 7 in the Running series. Last post said VRAM is the hard ceiling. That's true for discrete GPUs. Apple Silicon bends the ceiling in a way that matters specifically for local LLMs.

The discrete-GPU model

A traditional PC has two pools of memory:

  • System RAM. What the CPU sees. 16, 32, 64 GB on a typical desktop.
  • VRAM. What the GPU sees. Soldered to the GPU board. 12, 16, 24 GB on consumer cards.

To run inference on the GPU, the model file has to be in VRAM. If it doesn't fit, you're either offloading layers to the CPU (slow) or you're not running it. Your 64GB of system RAM does not help a 24GB GPU.

This is why a 4090 maxes out at around a 32B Q4 model. Bigger models simply cannot land on the chip.

The Apple Silicon model

Discrete VRAM vs unified memory

M-series chips put the CPU, GPU, and Neural Engine on the same piece of silicon, sharing one pool of memory. There is no separate "VRAM". The GPU can address every byte of RAM the CPU can.

That means: if your Mac has 64GB unified memory, the GPU can use up to about 75% of it (Apple reserves the rest for the OS) for model weights. So a 64GB MacBook Pro effectively gives the GPU around 48GB of usable VRAM. A 4090 gives you 24GB. The Mac wins on capacity by a factor of two, despite costing roughly the same.

Larger Macs are even more lopsided. A Mac Studio M3 Ultra with 192GB unified memory gives the GPU around 144GB to work with. That fits a 405B model at Q4. Nothing in the discrete-GPU consumer market touches that.

What about raw speed?

Apple Silicon is not as fast as a 4090 on small models. A 4090 will do 100+ tokens per second on a 7B model; an M3 Max does 60–90. The 4090 has more memory bandwidth (1 TB/s vs ~400 GB/s on M3 Max) and more compute throughput.

But on large models, the 4090 is doing 0 tok/s because the model doesn't fit. The Mac is doing 10–15 tok/s on a 70B Q4. Slow is faster than not at all.

The crossover point sits around 32B parameters. Below that, a 4090 wins. Above that, a 64GB+ Mac wins.

What fits in each tier

Working numbers, mid-2026, for an M3-class Mac at Q4 quantization:

  • 16 GB. 7B and 8B models. Maybe a 13B if you close everything else. Tight but workable for chat and small tasks.
  • 32 GB. Comfortable on 7B–13B. 32B is reachable but the OS gets cranky.
  • 64 GB. The first happy tier. 32B Q8 or 70B Q4 with room left over for VS Code, browser, Slack.
  • 128 GB. 70B Q8 or low-quant 200B-class MoE. Real workstation territory.
  • 192 GB (Mac Studio Ultra only). 405B at Q4. The top of the consumer ladder.

The 64GB tier is where Apple Silicon earns its reputation. That is the cheapest "I can run real models" machine on the market in 2026.

What this looks like in practice

On my M3 Max 64GB:

# install ollama via homebrew

brew install ollama
# pull and run a 70B model

ollama run llama3.1:70b "explain TLS handshake in three sentences"

The 70B Q4 file is around 40GB. It downloads once, then maps into unified memory in a few seconds. From the second prompt onward, latency is dominated by the GPU doing matrix multiplies on weights that already live in RAM.

I get roughly 12 tokens per second on that setup. That's slow compared to a Claude API call, but it's instant compared to "this model does not run on your machine."

The Metal runtime

Apple Silicon doesn't run CUDA. The GPU programming model is Metal — Apple's own API. Two consequences:

  • Anything written specifically for CUDA does not work on a Mac. PyTorch, llama.cpp, and the major runtimes all have Metal backends now, but a niche library written by a researcher last year might not.
  • The runtime that matters for local LLMs is llama.cpp's Metal kernels. Ollama and LM Studio both wrap llama.cpp, so by using either you get Metal acceleration for free.

The story used to be "Macs are slow on AI". That was true in 2022. As of 2026, the major local-LLM runtimes are first-class on Apple Silicon, often before they're optimized for any other platform.

What Apple Silicon is bad at

Three honest weaknesses:

  • Training, not inference. Unified memory is great for inference because you can fit the weights. Training needs vastly more memory (gradients, optimizer state) and the bandwidth cost of the unified-memory architecture starts to bite. Train on H100s, infer on Macs.
  • Multi-GPU. A Mac is one chip. You can't tensor-parallel across two M3 Maxes the way you can across two 4090s. The ceiling per machine is the chip you bought.
  • Power-curve scaling. A Mac Studio Ultra at 192GB pulls less than 400W under full load. Impressive, but a server room running H100s does not care about wattage. For pure throughput at scale, datacenter GPUs still win.

For a single developer running models on a laptop, none of those weaknesses matter. Inference is what you actually do.

Should you buy one for this?

If you're already on Apple, the question is just how much RAM. The honest tiers:

  • 24–32 GB: enough for 7B–13B. Decent starting point.
  • 64 GB: the sweet spot. Real local LLM machine without spending Pro Studio money.
  • 96–128 GB: only if you're running 70B Q8 or larger as a daily driver.
  • 192 GB Studio: niche. You either need 405B at home or you don't.

If you're not on Apple and you want a local LLM machine, the comparison is harder. A used 4090 + good PC will be faster on small models for less money. A Mac Studio M2 Ultra 64GB gets you bigger models for similar money. Neither is wrong; they optimize for different workloads.

What's next

Hardware is half the story. The other half is the runtime: the software that takes a GGUF file, loads it into memory, and turns prompts into tokens. Three of them dominate the local-LLM scene — llama.cpp, Ollama, LM Studio. Next post.

From the dictionary

Terms used in this post

Quick reference for the 19 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI
Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI
Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
e.g. This blog's create-post skill drafts inline using Claude.
GGUFML
GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral
A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML
Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI
A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI
A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural EngineGeneral
Apple's on-device NPU, present in every iPhone since the A11 (2017) and every Apple Silicon Mac. Handles Face ID, on-device dictation, photo classification, and increasingly ML model inference via Core ML. 16-core in M3/M4, 38 TOPS peak.
OllamaAI
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML
The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML
Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML
The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
VRAMGeneral
Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML
The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...