March 26, 20266 min read

The runtimes: llama.cpp, Ollama, LM Studio

llama.cpp is the engine. Ollama and LM Studio wrap it. What each one does, when to reach for which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.

A model file is just numbers sitting on disk. Something has to read those numbers, hold them in memory, and turn your prompt into tokens. That something is the runtime. Three of them own the local-LLM space in 2026, and here's the part most people miss: one of them sits underneath the other two.

This is post 5 of 7 in the AI Running series. The hardware tour is done. This is the software layer that actually runs the model on whatever box you ended up buying.

llama.cpp: the engine

llama.cpp is a C++ implementation of LLM inference. It loads GGUF model files, runs on CPU, CUDA, and Metal backends, and is basically the reason local LLMs are practical at all. It started as one person's project to get Llama running on a MacBook. In 2026 it's the de-facto standard inference engine for open weights.

What llama.cpp gives you:

GGUF. A single-file model format that bundles weights, tokenizer, and metadata together. Self-contained. You can email a model.
Quantization. Q4_K_M, Q5_K_S, Q8_0. The K-quants live here.
Hardware backends. CPU, CUDA, Metal, Vulkan, ROCm. Same model file, runs everywhere.
An OpenAI-compatible HTTP server. Spin it up with ./llama-server and it speaks the OpenAI Chat Completions API.

It's a developer tool, though, not a product. The CLI is fine. The build process is sometimes painful: CUDA flags, Metal flags, version drift. You can use it directly, but most people don't.

# clone and build llama.cpp with metal support

git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp && make LLAMA_METAL=1

# run a gguf model file

./llama-cli -m ~/models/llama-3.1-8b-instruct.Q4_K_M.gguf -p "explain DNS in two sentences"

If you read those two commands and know what they do, and a C++ build error doesn't ruin your afternoon, llama.cpp is yours. If not, keep reading.

Ollama: llama.cpp with a friendly face

Ollama wraps llama.cpp and adds the parts llama.cpp deliberately leaves out: a model registry, automatic downloads, a background daemon, and a clean CLI.

# install ollama via homebrew

brew install ollama

# run a model (pulls if missing)

ollama run llama3.1:8b

That second command is doing a lot. It pulls the right GGUF file from Ollama's registry, picks a quantization that fits your hardware, starts a background daemon, and drops you into a chat. The first run takes a minute. Every run after that is instant.

Ollama also exposes an OpenAI-compatible API on localhost:11434 by default. Any tool that talks to OpenAI can talk to Ollama with one base-URL change. That's how roughly 90% of local-LLM tooling actually wires up: OpenAI client → Ollama → llama.cpp → your hardware.

What I use Ollama for:

Drafting commit messages with a 1B model.
Local code completion as a fallback when the network's down.
Anything I do dozens of times a day, where the round-trip to a cloud API costs more than the model-quality gap.

Where it gets thin is the power-user knobs. You don't get fine-grained control over quantization variants, KV cache settings, or the raw llama.cpp flags unless you drop down to a Modelfile. For most people that's a non-issue. For a handful of use cases it's a wall.

LM Studio: the GUI

LM Studio is also a llama.cpp wrapper, but it's aimed at a different person. It's a desktop app with a model browser, a chat UI, and a server tab. No terminal required.

What LM Studio does best:

Model discovery. A built-in browser of Hugging Face GGUF files, with quality scores per quantization variant.
Hardware-aware suggestions. It tells you what fits, what almost fits, and what won't.
Local server. A toggle in the UI starts an OpenAI-compatible server. No daemon to babysit.
Side-by-side comparison. A multi-pane chat UI that runs two models against the same prompt.

It's closed-source, which matters to some people and not at all to others. The free tier is unrestricted for personal use. Commercial use needs a license. If you're a developer kicking the tires on local models before committing to one, LM Studio is the fastest route from "I have a Mac" to "I'm chatting with seven different models".

How they actually relate

Runtime stack from app to silicon

The architecture, drawn flat:

llama.cpp is the engine. It does the math.
Ollama is a CLI plus daemon plus registry on top of llama.cpp.
LM Studio is a GUI plus registry plus server on top of llama.cpp.

Both Ollama and LM Studio expose an OpenAI-compatible HTTP API. Both pull GGUF files from public registries. Both pick a sensible quantization for your hardware. The real differences are on the surface: one is a terminal tool, the other is a desktop app.

You can run all three at once. They don't fight each other. They just keep their own copies of the same model files in separate cache directories. Disk space is the only cost.

Picking one

Honest defaults:

You write code and live in the terminal: Ollama. Scriptable, plays nicely with shell pipelines, daemon-based.
You want a GUI to explore models: LM Studio. Better discovery, side-by-side comparison.
You're shipping a product, embedding inference, or doing weird quantization research: llama.cpp directly. You'll need the flags.

I run Ollama as my daemon and pull up LM Studio now and then to evaluate a new model before deciding whether it's worth adding to the Ollama setup. They're complementary, not rivals.

Other runtimes worth knowing

Three more, in shrinking order of relevance:

vLLM. A production inference server for serving open models at scale. PagedAttention, continuous batching. If you're hosting a model behind an API for many users, vLLM is what you actually want. llama.cpp is for single-user.
MLX (Apple). Apple's own ML framework, tuned for unified memory. Fast for prototyping models from scratch on Apple Silicon. Not the path for running pre-trained GGUFs. That's still llama.cpp.
TensorRT-LLM. NVIDIA's inference toolkit. Very fast on H100/H200, and a pain to set up. A datacenter tool.

For "run a model on my laptop," the answer is llama.cpp through Ollama or LM Studio. For "serve a model to my company's customers," it's vLLM or TensorRT-LLM.

A common gotcha

The OpenAI compatibility on Ollama and LM Studio is mostly real but not perfect. Differences I've actually hit:

Tool calling: Ollama's tool-call schema is OpenAI-compatible, but not every open model reliably produces tool calls. Runtime compatibility can't fix the model's training.
Streaming: works the same way. No surprises.
Embeddings endpoints: supported on both, but model availability varies.

If your code runs against the OpenAI API and you point it at Ollama, the basics just work. The advanced features depend on the model underneath, not the runtime.

Where to go from here

That's the local stack, end to end. You know what the hardware costs, what the model files are, and now what reads them. The runtime is the last piece of the how-to-run-it puzzle.

If you'd rather stop reading and start building, the hands-on Local LLMs series walks through exactly that: installing one of these runtimes, pulling a model, and getting it answering you on your own machine. The OpenAI-compatible API you just read about is the hook everything plugs into. If you're still weighing local against cloud, the next two posts close that loop: what the cloud actually costs per token, and what it quietly does with your data.

AI AI Running Llama Cpp LLM Lm Studio Ollama

From the dictionary

Terms used in this post

Quick reference for the 17 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
Machine LearningML: A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.; e.g. Gmail's spam filter learns which emails you mark as junk and updates its model — that's machine learning, not a rule someone wrote.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
Prompt CachingAI: Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

AI Running

5 / 7 posts

Browse all in AI Running →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
InferenceML: Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
Machine LearningML: A subset of AI where the system learns patterns from data instead of following hand-written rules. The output is a model — a set of learned numbers that maps inputs to outputs. Spam filters, recommendation systems, and credit-risk scorers are classical ML.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
Prompt CachingAI: Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

AI Running

5 / 7 posts

Browse all in AI Running →