2 min read

Install Ollama

Install Ollama on macOS, Linux, and Windows. Pull your first model, run it locally, and verify with ollama list. The fastest path to a local LLM.

Install Ollama

Ollama is the easiest way to run open-source LLMs locally. It wraps llama.cpp with a single binary, a model registry, and a daemon that exposes an OpenAI-compatible API at http://localhost:11434. If you want to chat with Llama or Qwen on your laptop without writing any code, this is the install.

macOS

Use Homebrew. There's a .dmg on the website too, but the brew formula keeps it updated.

# install ollama via homebrew

brew install ollama

If you don't have Homebrew yet, see Install Homebrew. The brew install drops a daemon you start manually, or you can install the .app for a menu-bar version that starts on login.

# start the ollama daemon (foreground)

ollama serve

Leave that terminal running, or use the menu-bar app instead.

Linux

One-line installer. It sets up a systemd service so the daemon starts on boot.

# install ollama on linux via the official script

curl -fsSL https://ollama.com/install.sh | sh

The script detects your GPU (NVIDIA via CUDA, AMD via ROCm) and installs matching libraries.

Windows

Download the installer from ollama.com/download or use winget:

# install ollama on windows

winget install --id=Ollama.Ollama -e

After install, Ollama runs as a tray app and starts the daemon automatically.

Pull and run a model

Start with a small model that fits in 8 GB of RAM:

# pull and chat with llama 3.2 3b

ollama run llama3.2:3b

The first run downloads the weights (about 2 GB). After that, every run is instant. Type a message and Ctrl-D to exit.

For a model that fits in 16 GB, try qwen2.5:7b. For 64 GB Apple Silicon, llama3.1:70b works at acceptable speed.

Verify

# list locally pulled models and the running daemon

ollama list && curl -s http://localhost:11434/api/tags | head

You should see the model you pulled and a JSON response from the API. If curl fails, the daemon isn't running — start it with ollama serve or open the tray app.

Common gotchas

  • Disk fills up fast: each model is 2–40 GB. ollama list shows what you have, ollama rm <model> removes one. Models live in ~/.ollama/models on macOS/Linux, %USERPROFILE%\.ollama\models on Windows.
  • Pick the right quant: tags like :7b-instruct-q4_K_M are quantized down to 4-bit. Q4 is the default sweet spot for memory vs quality. Q8 is closer to the original; F16 is the original, full size.
  • VRAM not RAM: on a discrete-GPU PC, the model has to fit in VRAM, not system RAM. A 13B Q4 model is ~7.5 GB — borderline for an 8 GB GPU. On Apple Silicon, unified memory means RAM = VRAM, so a 32 GB Mac can run models a 16 GB PC GPU can't.
  • API compatibility: Ollama's /v1/chat/completions endpoint is OpenAI-compatible, so most OpenAI client libraries work by pointing them at http://localhost:11434/v1.

With ollama list showing a model, you can write code against a local LLM the same way you would against the OpenAI or Anthropic API.

From the dictionary

Terms used in this post

Quick reference for the 8 terms you met above. Each one comes from the AI dictionary.

APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
GPUGeneral
A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
QuantizationML
Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
VRAMGeneral
Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML
The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...