March 23, 20265 min read

Why Apple Silicon punches above its weight on local LLMs

Unified memory lets the GPU see all of RAM. Here's why that beats a discrete-GPU PC past 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.

A 16-inch MacBook Pro with 64GB unified memory will run a model that a $1600 RTX 4090 cannot. If you built PCs in the discrete-GPU era, that sounds backwards. It comes down to one design choice: unified memory.

This is post 4 of 7 in the AI Running series. Last post made the case that VRAM is the hard ceiling. True for discrete GPUs. Apple Silicon bends that ceiling, and it bends it in exactly the way local LLMs care about.

The discrete-GPU model

A normal PC keeps two separate pools of memory:

System RAM. What the CPU sees. 16, 32, 64 GB on a typical desktop.
VRAM. What the GPU sees. Soldered to the GPU board. 12, 16, 24 GB on consumer cards.

To run inference on the GPU, the model file has to sit in VRAM. If it doesn't fit, you've got two options. Offload some layers to the CPU and crawl, or just don't run it. Your 64GB of system RAM is no help to a 24GB GPU.

That's why a 4090 tops out around a 32B Q4 model. Anything bigger can't land on the chip.

The Apple Silicon model

Discrete VRAM vs unified memory

M-series chips put the CPU, GPU, and Neural Engine on one piece of silicon, all sharing a single pool of memory. There's no separate "VRAM" at all. The GPU can address every byte the CPU can.

Here's what that buys you. If your Mac has 64GB unified memory, the GPU can use up to about 75% of it for model weights. Apple reserves the rest for the OS. So a 64GB MacBook Pro hands the GPU roughly 48GB of usable VRAM. A 4090 gives you 24GB. The Mac doubles it for about the same money.

Bigger Macs tilt the scales harder. A Mac Studio M3 Ultra with 192GB unified memory gives the GPU around 144GB to play with. That's enough for a 405B model at Q4. Nothing in the consumer discrete-GPU market gets close.

What about raw speed?

Apple Silicon is slower than a 4090 on small models, no point pretending otherwise. A 4090 will do 100+ tokens per second on a 7B model. An M3 Max does 60 to 90. The 4090 has more memory bandwidth (1 TB/s against roughly 400 GB/s on M3 Max) and more raw compute.

But on a large model, the 4090 is doing 0 tok/s, because the model never fits in the first place. The Mac is doing 10 to 15 tok/s on a 70B Q4. Slow beats nothing.

The crossover lands around 32B parameters. Below that, the 4090 wins. Above it, a 64GB+ Mac wins.

What fits in each tier

Working numbers, mid-2026, for an M3-class Mac at Q4 quantization:

16 GB. 7B and 8B models. Maybe a 13B if you close everything else. Tight, but workable for chat and small tasks.
32 GB. Comfortable on 7B to 13B. 32B is reachable, but the OS gets cranky.
64 GB. The first happy tier. 32B Q8 or 70B Q4 with room left for VS Code, a browser, Slack.
128 GB. 70B Q8 or a low-quant 200B-class MoE. Real workstation territory.
192 GB (Mac Studio Ultra only). 405B at Q4. Top of the consumer ladder.

The 64GB tier is where Apple Silicon earns its name. It's the cheapest "I can actually run real models" machine you can buy in 2026.

What this looks like in practice

On my M3 Max 64GB:

# install ollama via homebrew

brew install ollama

# pull and run a 70B model

ollama run llama3.1:70b "explain TLS handshake in three sentences"

The 70B Q4 file is around 40GB. It downloads once, then maps into unified memory in a few seconds. From the second prompt on, latency is just the GPU doing matrix multiplies on weights that already live in RAM.

I get roughly 12 tokens per second on that setup. Slow next to a Claude API call. Instant next to "this model does not run on your machine."

The Metal runtime

Apple Silicon doesn't run CUDA. The GPU programming model is Metal, Apple's own API. Two things follow from that.

Anything written specifically for CUDA won't run on a Mac. PyTorch, llama.cpp, and the major runtimes all ship Metal backends now, but some niche library a researcher wrote last year might not.

The one that matters for local LLMs is llama.cpp's Metal kernels. Ollama and LM Studio both wrap llama.cpp, so pick either and you get Metal acceleration for free.

The old line was "Macs are slow on AI." That was true in 2022. By 2026 the major local-LLM runtimes are first-class on Apple Silicon, often tuned for it before any other platform.

What Apple Silicon is bad at

Three honest weaknesses:

Training, not inference. Unified memory is great for inference because you can fit the weights. Training wants far more memory (gradients, optimizer state), and the bandwidth cost of unified memory starts to bite. Train on H100s, infer on Macs.
Multi-GPU. A Mac is one chip. You can't tensor-parallel across two M3 Maxes the way you can across two 4090s. Your ceiling per machine is the chip you bought.
Power-curve scaling. A Mac Studio Ultra at 192GB pulls under 400W at full load. Genuinely impressive, but a server room full of H100s doesn't care about watts. For pure throughput at scale, datacenter GPUs still win.

For one developer running models on a laptop, none of that bites. Inference is the whole job.

Should you buy one for this?

If you're already on Apple, the only question is how much RAM. The honest tiers:

24 to 32 GB: enough for 7B to 13B. Decent starting point.
64 GB: the sweet spot. A real local LLM machine without Pro Studio money.
96 to 128 GB: only if you're running 70B Q8 or larger every day.
192 GB Studio: niche. You either need 405B at home or you don't.

If you're not on Apple and you want a local LLM box, the call is harder. A used 4090 plus a good PC will be faster on small models for less money. A Mac Studio M2 Ultra 64GB gets you bigger models for similar money. Neither is wrong. They just optimize for different work.

What's next

Hardware is half the story. The other half is the runtime: the software that takes a GGUF file, loads it into memory, and turns prompts into tokens. Three of them run the local-LLM scene right now, llama.cpp, Ollama, and LM Studio. That's the next post.

AI AI Running Apple Silicon Hardware LLM Local Models

From the dictionary

Terms used in this post

Quick reference for the 19 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural EngineGeneral: Apple's on-device NPU, present in every iPhone since the A11 (2017) and every Apple Silicon Mac. Handles Face ID, on-device dictation, photo classification, and increasingly ML model inference via Core ML. 16-core in M3/M4, 38 TOPS peak.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

AI Running

4 / 7 posts

Browse all in AI Running →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural EngineGeneral: Apple's on-device NPU, present in every iPhone since the A11 (2017) and every Apple Silicon Mac. Handles Face ID, on-device dictation, photo classification, and increasingly ML model inference via Core ML. 16-core in M3/M4, 38 TOPS peak.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

AI Running

4 / 7 posts

Browse all in AI Running →