March 16, 20266 min read

Where AI actually runs: cloud, local, edge

When you use AI, a model file is sitting on a real machine. There are only three places it can be, and which one decides almost everything else.

Every time you "use AI", a model file gets loaded into memory on some specific machine. That machine is in one of three places. A datacenter you reach over the internet, your own laptop or workstation, or the device the model ships inside of. Cloud, local, edge. Every product you've ever touched picks one of these, sometimes two, and the pick is almost never an accident.

This is post 1 of 7 in the Running series. I'm starting here on purpose. The rest of it (which models, what hardware, which runtime, what it costs) only makes sense once you know which of the three boxes you're standing in.

Cloud: someone else's GPU

Call the OpenAI API, open Claude in a browser, fire a message at a Gemini app. In every case the real model, say Claude Opus 4.7 at around 2 trillion parameters of weights, is sitting on H100 or B200 GPUs in a datacenter. Your prompt goes over the wire, the GPU runs the forward passes, the tokens stream back to you.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-opus-4-7",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Three numbers tell you most of what you need to know about cloud inference:

Latency: 200 to 1500ms time-to-first-token, depending on prompt size and model. Streaming hides this, since you watch tokens appear as they generate, but the round trip is still real.
Cost: priced per million tokens. Frontier models in 2026 land around $3 to $15/Mtok input and $15 to $75/Mtok output. Turn on prompt caching and that drops 5 to 10x for repeated context.
Capability: this is where the frontier lives. That 2T-parameter model topping every benchmark does not run on your laptop. It can't. Frontier models need 8 to 32 H100s just to hold the weights.

Here's the deal you're signing. Your data leaves the building, you ride a vendor's uptime, and they own the pricing. In exchange you write zero infra, you get the best model on the planet, and the thing stops existing the second your AWS bill stops.

For most product features, that's the right trade. Capability beats data sovereignty most of the time. It's the wrong trade when you're handling PII you legally can't ship offshore, or when you're doing 10M calls a day and the bill starts eating the P&L alive.

Local: a model file on your disk

Local means the model weights sit on your hard drive and the runtime loads them into RAM (CPU) or VRAM (GPU). When I say "I run llama3.1:8b on my M3 Max", what I mean is a 4.7GB GGUF file on the SSD, mmapped into the M3's unified memory, with Metal kernels doing the matrix multiplies.

brew install ollama

ollama run llama3.1:8b "explain TCP slow start in two sentences"

That's the whole on-ramp. The model downloads once (~5GB), and after that every call runs local. No network. No API key. No rate limits, no per-token bill.

The numbers look different here:

Latency: 50 to 500ms time-to-first-token on consumer hardware. Throughput is hardware-bound. On an M3 Max with 64GB unified memory an 8B model does about 60 to 90 tok/s, while a 70B Q4 quant does 10 to 15 tok/s. An RTX 4090 is faster on the small models and slower on the big ones, because the 24GB VRAM ceiling becomes the bottleneck.
Cost: $0 marginal per call. Capex is the laptop or workstation. Opex is electricity, and a 4090 pulls 450W under load.
Capability: the ceiling is your RAM/VRAM. Consumer hardware tops out at 70B-class quantized models. The frontier (1T+) is out of reach without a server.

What you give up is a generation or two of model quality. Llama 3.1 70B is genuinely good at a lot of things, but it is not Claude Opus. Tool use, agentic work, long-context reasoning, all of that still leans toward the frontier models.

What you get back is privacy, predictable latency, no rate limits, and a model that still works on a flight with no wifi. There's a quieter payoff too. You start to understand what an LLM actually is. Once you've watched 4.7GB of weights mmap into memory, it gets hard to keep treating these things as magic.

My own setup splits the work. Local model for drafting commit messages, fast text reformatting, anything I run hundreds of times a day where the round trip is the thing that hurts. Cloud model for real reasoning, code generation, anything where the output has to be at the frontier.

Edge: the model is the device

Edge AI means the model is baked into the device's silicon. Your phone's NPU, the chip in a Ring doorbell, an AC controller, an industrial sensor. No internet in the loop at all. The model wakes up, runs inference on local sensor data, takes an action, and goes back to sleep.

You've used edge AI today and probably didn't notice:

Face ID: the model deciding "this is you" runs on the iPhone's Neural Engine in around 100ms. Apple never sees your face.
Live Captions on Pixel: Whisper-class speech-to-text, running right on the device.
Smart compose in Gboard: token suggestions from a tiny language model that ships inside the keyboard.
Garage cameras doing person and vehicle detection with no cloud subscription.

Three constraints define edge:

Model size: usually under 500M parameters, often a quantized vision or audio model under 100MB. Nothing here is a frontier LLM.
Latency: 5 to 50ms. That's the whole point. There's no network in the loop.
Power: it has to run on a battery or a small SOC. NPUs (Apple's Neural Engine, Qualcomm Hexagon, Google Tensor) tune for tokens-per-watt, not raw throughput.

You don't deploy edge AI the way you deploy cloud or local. You ship it inside the firmware. The toolchain is Core ML on Apple, NNAPI/TFLite on Android, ONNX Runtime if you want cross-platform. All three take a trained model and compile it down to run on the target NPU.

This is the most invisible of the three. Every "smart" device you own is doing some flavor of edge inference, and the promise the user actually feels is just "works fast, works offline".

Choosing between them

There's no universal answer. There is a useful decision sequence, and you run it top-down:

Does the data legally have to stay on-device? Health PII, biometrics, certain financial data under specific rules. If yes: edge or local. Decision's over.
Is the latency budget under 100ms? Real-time speech, AR overlays, sensor loops. Edge or local. Cloud round-trips can't hit it consistently.
Do you need frontier capability? Multi-step reasoning, long-context retrieval, tool-using agents. In 2026 cloud is the only honest answer. Local 70B is good, not frontier.
What's the call volume? At 10M+ calls a day the cloud bill starts to dominate. Local or self-hosted turns into a real option, same model family on your own hardware.
Is offline a hard requirement? Field tools, remote sites, planes. Local or edge.

Most production systems end up hybrid anyway. A voice assistant does wakeword on-device (edge), ships the transcribed audio to a cloud LLM (cloud), then renders the answer locally with on-device TTS (edge). A coding agent runs quick edits against a local model and escalates to a cloud model when the reasoning gets hard. Forcing one box on the whole product is usually the wrong move.

What's next in this series

The next six posts walk through the running part:

which models are at the frontier in 2026, and what their context windows actually buy you
what it takes to run a model on your machine (VRAM math, quantization, the real hardware tiers)
why Apple Silicon punches above its weight on local LLMs
the runtimes (llama.cpp, Ollama, LM Studio), and when to pick each
the economics of cloud tokens, and how prompt caching changes the math
what actually leaves your machine when you use AI

The pattern under all of them is the same. The model sits somewhere. The somewhere decides almost everything else.

AI Edge Hardware Inference LLM Local LLM

From the dictionary

Terms used in this post

Quick reference for the 25 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Core MLAI: Apple's framework for running ML models on-device across iPhone, iPad, and Mac. Compiles ONNX or PyTorch models to the Neural Engine and GPU. Used by Face ID, Live Text, and most third-party iOS apps that ship local ML.
Edge AIAI: Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.; e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural EngineGeneral: Apple's on-device NPU, present in every iPhone since the A11 (2017) and every Apple Silicon Mac. Handles Face ID, on-device dictation, photo classification, and increasingly ML model inference via Core ML. 16-core in M3/M4, 38 TOPS peak.
NPUGeneral: Neural Processing Unit. A dedicated chip optimized for tensor operations at low power, used for on-device AI. Examples: Apple Neural Engine (iPhone, Mac), Qualcomm Hexagon (Android), Google Tensor. Optimized for tokens-per-watt rather than peak throughput.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
Prompt CachingAI: Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

AI Running

1 / 7 posts

Browse all in AI Running →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Core MLAI: Apple's framework for running ML models on-device across iPhone, iPad, and Mac. Compiles ONNX or PyTorch models to the Neural Engine and GPU. Used by Face ID, Live Text, and most third-party iOS apps that ship local ML.
Edge AIAI: Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
GeminiAI: Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural EngineGeneral: Apple's on-device NPU, present in every iPhone since the A11 (2017) and every Apple Silicon Mac. Handles Face ID, on-device dictation, photo classification, and increasingly ML model inference via Core ML. 16-core in M3/M4, 38 TOPS peak.
NPUGeneral: Neural Processing Unit. A dedicated chip optimized for tensor operations at low power, used for on-device AI. Examples: Apple Neural Engine (iPhone, Mac), Qualcomm Hexagon (Android), Google Tensor. Optimized for tokens-per-watt rather than peak throughput.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
Prompt CachingAI: Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

AI Running

1 / 7 posts

Browse all in AI Running →