Ollama
AI/dictionary/ollama
Definition
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
Related terms
Posts that use this term
- Troubleshooting local LLMs and keeping up
The catalog of common local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, RAG miss, tool-call hallucination. Plus where to follow the field as it moves.
- Fine-tuning a model locally
When fine-tuning is the right answer (rarely) and how to do it on consumer hardware: LoRA, QLoRA, MLX-LM, Unsloth. A worked example fine-tuning Llama 3.2 3B on a 16GB Mac.
- Local agents and tool use
Function calling on open models in 2026: which models actually work (Qwen 2.5, Hermes 3, Llama 14B+), why local agents fail when they fail, and how to build defensive scaffolding around them.
- Local RAG and embeddings
A complete local RAG pipeline in 30 lines: nomic-embed-text for embeddings, Chroma for the vector DB, Llama 3.2 for the chat model. Why local RAG often beats cloud RAG for personal knowledge bases.
- Integrating a local LLM into your workflow
Wire your local LLM into VS Code (Continue, Cline), web UIs (Open WebUI, LibreChat, Page Assist), and your own apps via the OpenAI-compatible API. The swap-cloud-for-local pattern in real codebases.
- Your first local LLM, end to end
Install Ollama, pull Llama 3.2 3B, chat, hit the OpenAI-compatible API, and troubleshoot the five things that go wrong on first install. By the end of this post you have a working local LLM.
- Every machine can run a local LLM (here's what fits)
Per-tier guide: 8GB integrated graphics, 16GB MacBook Air, 8/12/16/24/32GB VRAM PCs, 24/32/64/128/192GB Macs. Specific models, specific tok/s, specific configs. Every tier runs something useful.
- System requirements by OS for local LLMs
What macOS, Linux, and Windows each need to run a local LLM in 2026. Native Windows now works smoothly; WSL2 for Linux power users; Mac is the smoothest path; Linux gives you the most knobs.
- Streaming, throughput, and the KV cache
TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.
- The local-LLM vocabulary
Parameters, B, dense vs MoE, base vs instruct, tokens, context window, chat template, GGUF, quantization suffixes. After this post you can read any HuggingFace model card.
- What leaves your machine when you use AI
What providers actually see, log, and retain when you call an LLM API in 2026. What 'we don't train on your data' really means, free vs paid tier differences, and when local is the only safe option.
- The runtimes: llama.cpp, Ollama, LM Studio
llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.
- Why Apple Silicon punches above its weight on local LLMs
Unified memory means the GPU sees all of RAM. Why that beats discrete-GPU PCs above 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.
- What it takes to run a model on your machine
Why VRAM is the hard ceiling on local LLMs, what quantization actually does to a model file, and the practical hardware ladder from 8GB laptops to 192GB workstations.
- Where AI actually runs: cloud, local, edge
Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.
- Install the OpenAI SDK
Install the OpenAI SDK for Python and Node, configure your API key, and verify with a one-line chat.completions call.
- Install LM Studio
Install LM Studio on macOS, Linux, and Windows. The fastest GUI for running local LLMs — no terminal needed. Includes the local server for OpenAI-compatible API access.
- Install llama.cpp
Build llama.cpp from source with Metal or CUDA acceleration. Run a GGUF model with llama-cli. The closest thing to bare-metal local inference.
- Install Ollama
Install Ollama on macOS, Linux, and Windows. Pull your first model, run it locally, and verify with ollama list. The fastest path to a local LLM.