The runtimes: llama.cpp, Ollama, LM Studio
llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.

A model file is just numbers. The runtime is what reads those numbers, holds them in memory, and turns your prompt into tokens. Three runtimes own the local-LLM space in 2026, and one of them sits underneath the other two.
This is post 5 of 7 in the Running series. Last post finished the hardware tour. This one is about the software layer that actually runs the model on whatever hardware you bought.
llama.cpp: the engine
llama.cpp is a C++ implementation of LLM inference. It loads GGUF model files, supports CPU, CUDA, and Metal backends, and is the reason local LLMs are practical at all. Started as one person's project to run Llama on a MacBook; in 2026 it is the de-facto standard inference engine for open weights.
What llama.cpp gives you:
- GGUF. A single-file model format with weights, tokenizer, and metadata. Self-contained. You can email a model.
- Quantization. Q4_K_M, Q5_K_S, Q8_0. The K-quants live here.
- Hardware backends. CPU, CUDA, Metal, Vulkan, ROCm. Same model file, runs everywhere.
- An OpenAI-compatible HTTP server. Spin it up with
./llama-serverand it speaks the OpenAI Chat Completions API.
It is also a developer tool, not a product. The CLI is fine; the build process is sometimes painful (CUDA flags, Metal flags, version drift). You can use it directly, but most people don't.
# clone and build llama.cpp with metal support
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make LLAMA_METAL=1
# run a gguf model file
./llama-cli -m ~/models/llama-3.1-8b-instruct.Q4_K_M.gguf -p "explain DNS in two sentences"
If you understand what those commands do and you're comfortable with C++ build errors, llama.cpp is yours. If not, keep reading.
Ollama: llama.cpp with a friendly face
Ollama wraps llama.cpp and adds the things llama.cpp deliberately doesn't ship: a model registry, automatic downloads, a daemon, and a clean CLI.
# install ollama via homebrew
brew install ollama
# run a model — pulls if missing
ollama run llama3.1:8b
That second command does a lot: it pulls the right GGUF file from Ollama's registry, picks the quantization that fits your hardware, starts a background daemon, and drops you into a chat. First time, it takes a minute. Every time after, it's instant.
Ollama also exposes an OpenAI-compatible API on localhost:11434 by default. Any tool that talks to OpenAI can talk to Ollama with a base URL change. This is how 90% of local-LLM tooling actually integrates: OpenAI client → Ollama → llama.cpp → your hardware.
What I use Ollama for:
- Drafting commit messages with a 1B model.
- Local code completion as a fallback when the network is down.
- Anything I do dozens of times a day where the round-trip latency to a cloud API is worse than the model quality difference.
Where Ollama is weak: power-user knobs. You don't get fine-grained control over quantization variants, KV cache settings, or the underlying llama.cpp flags without going to a Modelfile. For most people, that's fine. For a few use cases, it's a wall.
LM Studio: the GUI
LM Studio is also a llama.cpp wrapper, but the audience is different: it's a desktop application with a model browser, a chat UI, and a server tab. Zero terminal required.
What LM Studio is best at:
- Model discovery. A built-in browser of Hugging Face GGUF files with quality scores per quantization variant.
- Hardware-aware suggestions. Tells you what fits, what almost fits, what won't.
- Local server. A toggle in the UI starts an OpenAI-compatible server. No daemon to manage.
- Comparing models side by side. A multi-pane chat UI that runs two models against the same prompt.
It is closed-source, which matters to some people and not to others. The free tier is unrestricted for personal use; commercial use needs a license. For a developer evaluating local models before picking one, LM Studio is the fastest path from "I have a Mac" to "I'm chatting with seven different models".
How they actually relate

The architecture, drawn flat:
- llama.cpp is the engine. It does the math.
- Ollama is a CLI + daemon + registry on top of llama.cpp.
- LM Studio is a GUI + registry + server on top of llama.cpp.
Both Ollama and LM Studio expose an OpenAI-compatible HTTP API. Both pull GGUF files from public registries. Both pick a sensible quantization for your hardware. The differences are surface-level: one is a terminal tool, the other is a desktop app.
You can run all three at once. They don't conflict; they just hold copies of the same model files in different cache directories. The disk space is the only real cost.
Picking one
Honest defaults:
- You write code and live in the terminal: Ollama. Scriptable, integrates cleanly with shell pipelines, daemon-based.
- You want a GUI to explore models: LM Studio. Better discovery, side-by-side comparison.
- You're shipping a product, embedding inference, or doing weird quantization research: llama.cpp directly. You'll need the flags.
I run Ollama as my daemon and use LM Studio occasionally to evaluate new models before deciding what to add to my Ollama setup. They are complementary, not rivals.
Other runtimes worth knowing
Three more, in shrinking order of relevance:
- vLLM. Production inference server for serving open models at scale. PagedAttention, continuous batching. If you're hosting a model behind an API for many users, vLLM is what you actually want; llama.cpp is for single-user.
- MLX (Apple). Apple's own ML framework, optimized for unified memory. Fast for prototyping models from scratch on Apple Silicon. Not the path for running pre-trained GGUFs — that's still llama.cpp.
- TensorRT-LLM. NVIDIA's inference toolkit. Very fast on H100/H200, complicated to set up. Datacenter tool.
For "run a model on my laptop", llama.cpp through Ollama or LM Studio is the answer. For "serve a model to my company's customers", vLLM or TensorRT-LLM.
A common gotcha
The OpenAI compatibility on Ollama and LM Studio is mostly real but not perfect. Differences I've hit:
- Tool calling: Ollama's tool-call schema is OpenAI-compatible but not all open models actually produce tool calls reliably. The runtime compatibility doesn't fix the model's training.
- Streaming: works the same way. No surprises here.
- Embeddings endpoints: supported on both, but model availability varies.
If your code runs against the OpenAI API and you point it at Ollama, the basics work. The advanced features depend on the underlying model, not the runtime.
What's next
Most production AI usage isn't local. It's an API call to Anthropic, OpenAI, or Google, where the unit of cost is the token. Next post unpacks how token billing actually works, why bills surprise people, and how prompt caching changes the math.
From the dictionary
Terms used in this post
Quick reference for the 17 terms you met above. Each one comes from the AI dictionary.
- Artificial IntelligenceAI
- Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
- e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
- APIGeneral
- Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- GGUFML
- GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
- InferenceML
- Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
- llama.cppAI
- A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- LM StudioAI
- A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
- Machine LearningML
- A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.
- e.g. Gmail's spam filter learns which emails you mark as junk and updates its model — that's machine learning, not a rule someone wrote.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- Prompt CachingAI
- Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Where AI actually runs: cloud, local, edge
March 16, 2026
- 02
What leaves your machine when you use AI
March 31, 2026
- 03
LLM APIs and the economics of tokens
March 28, 2026
- 04
The runtimes: llama.cpp, Ollama, LM Studio
March 26, 2026
- 05
Why Apple Silicon punches above its weight on local LLMs
March 23, 2026
- 06
What it takes to run a model on your machine
March 21, 2026
- 07
The major LLMs in 2026
March 18, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...