6 min read

Where AI actually runs: cloud, local, edge

Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.

Where AI actually runs: cloud, local, edge

When you "use AI", a model file is loaded into memory on a specific machine somewhere. That machine sits in one of three places: a datacenter you talk to over the internet, your own laptop or workstation, or the device the model is shipped inside of. Cloud, local, edge. Every product you've ever touched picks one (sometimes two), and the choice is rarely accidental.

This is post 1 of 7 in the Running series. I'll start here because the rest of it — which models, what hardware, which runtime, what it costs — only makes sense once you know which of the three boxes you're standing in.

Cloud: someone else's GPU

When you call the OpenAI API, hit Claude on the web, or send a message to a Gemini app, the actual model (call it Claude Opus 4.7, around 2 trillion parameters of weights) sits on H100 or B200 GPUs in a datacenter. Your prompt goes over the wire, the GPU runs forward passes, the tokens stream back.

curl https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{
    "model": "claude-opus-4-7",
    "max_tokens": 1024,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Three numbers to understand cloud inference:

  • Latency: 200–1500ms time-to-first-token, depending on prompt size and model. Streaming masks this (you see tokens appearing as they generate), but the round trip is real.
  • Cost: priced per million tokens. Frontier models in 2026 sit around $3–15/Mtok input, $15–75/Mtok output. With prompt caching this drops 5–10x for repeated context.
  • Capability: this is where the frontier lives. The 2T-parameter model that tops every benchmark does not run on your laptop. It cannot. Frontier models need 8 to 32 H100s just to load the weights.

What you give up: your data leaves the building, you depend on a vendor's uptime, and pricing is theirs to change. What you gain: zero infra, the best model on the planet, and you stop existing the moment your AWS bill ends.

This is the right answer for almost every product feature where capability matters more than data sovereignty. It is the wrong answer when you're processing PII you can't legally ship offshore, or when you're doing 10M calls a day and the bill is starting to dominate the P&L.

Local: a model file on your disk

Local means the model weights live on your hard drive and the runtime loads them into RAM (CPU) or VRAM (GPU). When I say "I run llama3.1:8b on my M3 Max", I mean a 4.7GB GGUF file on the SSD, mmapped into the M3's unified memory, with Metal kernels doing the matrix multiplies.

brew install ollama
ollama run llama3.1:8b "explain TCP slow start in two sentences"

That is the entire on-ramp. The model downloads once (~5GB), and from there every call is local. No network, no API key, no rate limits, no per-token bill.

The numbers are different:

  • Latency: 50–500ms time-to-first-token on consumer hardware. Throughput is hardware-bound. On an M3 Max with 64GB unified memory, an 8B model does about 60–90 tok/s; a 70B Q4 quant does 10–15 tok/s. An RTX 4090 is faster on smaller models and slower on big ones (the 24GB VRAM ceiling becomes the bottleneck).
  • Cost: $0 marginal per call. Capex is the laptop or workstation; opex is the electricity (a 4090 pulls 450W under load).
  • Capability: the ceiling depends on RAM/VRAM. Consumer hardware caps out at 70B-class quantized models. The frontier (1T+) is out of reach without a server.

What you give up: a generation or two of model quality. Llama 3.1 70B is brilliant for many tasks but it is not Claude Opus. Tool use, agentic work, and long-context reasoning still favor frontier models.

What you gain: privacy, predictable latency, no rate limits, and the model still works on a flight with no wifi. Plus you start to understand what an LLM actually is. Once you have watched 4.7GB of weights mmap into memory, it is hard to keep treating these things as magic.

In my own setup: I use a local model for drafting commit messages, fast text reformatting, anything I do hundreds of times a day where round-trip latency matters. I use a cloud model for real reasoning, code generation, and anything where the output quality needs to be at the frontier.

Edge: the model is the device

Edge AI means the model is baked into the device's silicon (your phone's NPU, the chip in a Ring doorbell, an AC controller, an industrial sensor). There is no internet in the loop. The model wakes up, runs inference on local sensor data, takes an action, and goes back to sleep.

You have used edge AI today, probably without noticing:

  • Face ID: the model that decides "this is you" runs on the iPhone's Neural Engine in around 100ms. Apple cannot see your face.
  • Live Captions on Pixel: Whisper-class speech-to-text running on the device.
  • Smart compose in Gboard: token suggestions from a tiny language model that ships with the keyboard.
  • Garage cameras doing person/vehicle detection without a cloud subscription.

Three constraints define edge:

  • Model size: usually under 500M parameters, often a quantized vision or audio model under 100MB. Nothing here is a frontier LLM.
  • Latency: 5–50ms. The whole point: there is no network in the loop.
  • Power: must run on battery or a small SOC. NPUs (Apple's Neural Engine, Qualcomm Hexagon, Google Tensor) optimize for tokens-per-watt, not raw throughput.

You do not deploy edge AI the way you deploy cloud or local. You ship it inside the firmware. The toolchain is Core ML on Apple, NNAPI/TFLite on Android, ONNX Runtime cross-platform. All of them take a trained model and compile it to run on the target NPU.

This is the most invisible of the three. Every "smart" device you own is doing some flavor of edge inference, and the user-visible promise is simply "works fast and works offline".

Choosing between them

There is no universal answer, but there is a useful decision sequence. Run it top-down:

  1. Does the data legally have to stay on-device? Health PII, biometrics, certain financial data under specific regulations. If yes: edge or local. End of decision.
  2. Is the latency budget under 100ms? Real-time speech, AR overlays, sensor loops. Edge or local. Cloud round-trips cannot hit it consistently.
  3. Do you need frontier capability? Multi-step reasoning, long-context retrieval, tool-using agents. Cloud is the only honest answer in 2026. Local 70B is good, not frontier.
  4. What is the call volume? At 10M+ calls a day, cloud bills start dominating. Local or self-hosted becomes a real option (same model family, your hardware).
  5. Is offline a hard requirement? Field tools, remote sites, planes. Local or edge.

Most production systems are hybrid. A voice assistant does wakeword on-device (edge), sends transcribed audio to a cloud LLM (cloud), then renders the answer locally with on-device TTS (edge). A coding agent runs quick edits against a local model and escalates to a cloud model for hard reasoning. Picking one box for the whole product is usually the wrong move.

What's next in this series

The next six posts walk through the running part:

The pattern across all of them: the model sits somewhere. The somewhere decides almost everything else.

From the dictionary

Terms used in this post

Quick reference for the 25 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI
Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI
OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI
Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP
The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Core MLAI
Apple's framework for running ML models on-device across iPhone, iPad, and Mac. Compiles ONNX or PyTorch models to the Neural Engine and GPU. Used by Face ID, Live Text, and most third-party iOS apps that ship local ML.
Edge AIAI
Running model inference on the device the data is captured on (phone, camera, sensor) rather than sending it to a server. Models are usually quantized and under 500M parameters. Latency is 5-50ms because there is no network in the loop. Powers Face ID, on-device speech recognition, doorbell person detection.
GeminiAI
Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GGUFML
GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral
A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML
Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI
A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI
A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
Neural EngineGeneral
Apple's on-device NPU, present in every iPhone since the A11 (2017) and every Apple Silicon Mac. Handles Face ID, on-device dictation, photo classification, and increasingly ML model inference via Core ML. 16-core in M3/M4, 38 TOPS peak.
NPUGeneral
Neural Processing Unit. A dedicated chip optimized for tensor operations at low power, used for on-device AI. Examples: Apple Neural Engine (iPhone, Mac), Qualcomm Hexagon (Android), Google Tensor. Optimized for tokens-per-watt rather than peak throughput.
OllamaAI
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML
The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
Prompt CachingAI
Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
QuantizationML
Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
VRAMGeneral
Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML
The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...