4 min read

Your first local LLM, end to end

Install Ollama, pull Llama 3.2 3B, chat, hit the OpenAI-compatible API, and troubleshoot the five things that go wrong on first install. By the end of this post you have a working local LLM.

Your first local LLM, end to end

By the end of this post you'll have a local LLM running on your machine, regardless of what machine you have. Three commands, one model, one chat. The whole on-ramp.

This is post 8 of 13 in the Local LLMs series. Theory's done. Time for muscle memory.

What we're installing

Hello World flow

The stack:

  • Ollama as the runtime. Wraps llama.cpp, handles model downloads, exposes an OpenAI-compatible API on localhost. Runs as a background daemon.
  • Llama 3.2 3B Instruct as the model. Around 2 GB on disk, runs on every tier from this series. Good general-purpose chat model.

If you want a different model later, you'll swap one filename. The flow is the same.

Step 1: install Ollama

Pick the line for your OS.

# macos via homebrew

brew install ollama
# linux (auto-detects gpu)

curl -fsSL https://ollama.com/install.sh | sh

For Windows, download the installer from ollama.com/download and run it. The installer registers Ollama as a Windows service that starts at boot.

Ollama is now running as a background service on every OS. The CLI talks to it; you don't have to start anything else.

Verify:

# check the version

ollama --version

If you get a version number, you're done with install.

Step 2: pull and run a model

The first command does both: it downloads the model if missing, then drops you into an interactive chat.

# pull and chat

ollama run llama3.2:3b

First run, you'll see the download progress (about 2 GB at typical broadband speeds, 5–15 minutes). Subsequent runs start instantly because the model is already on disk.

Once you see >>> (the prompt indicator), you're chatting with a local model. Try:

>>> what is the capital of australia?

You should get something like "Canberra is the capital of Australia." within a second.

Type /bye or hit Ctrl-D to exit. The model stays loaded in memory for a few minutes after you exit, so the next ollama run is fast. After that, it gets unloaded and reloads on next use.

Step 3: the API

You can also talk to Ollama via HTTP. This is how every tool that integrates with Ollama works (VS Code plugins, web UIs, scripts).

In a separate terminal, with no ollama run open:

# generate a one-shot completion via http api

curl http://localhost:11434/api/generate -d '{"model":"llama3.2:3b","prompt":"why is the sky blue?","stream":false}'

You'll get a JSON response with the model's answer in the response field. Set stream: true to get token-by-token streaming.

Ollama also exposes an OpenAI-compatible endpoint, which is what most existing tools use:

# the openai-compatible chat endpoint

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Anything that points at the OpenAI API can be redirected to Ollama by changing the base URL to http://localhost:11434/v1. Same response format.

Step 4: a few useful commands

The Ollama CLI commands you'll actually use:

# list models you have on disk

ollama list
# remove a model to free disk space

ollama rm llama3.2:3b
# show details about a model

ollama show llama3.2:3b
# stop the daemon (rare, mostly for upgrades)

ollama stop

Common errors and what they mean

The five things that go wrong on first install, in rough frequency:

"Error: model 'llama3.2:3b' not found, try pulling it first." You ran ollama show or similar before pulling. Just ollama run llama3.2:3b and it'll pull.

"Error: Ollama daemon not running" / connection refused. The background service didn't start. On macOS, brew services start ollama. On Linux, sudo systemctl start ollama. On Windows, Ollama should auto-start; check Task Manager for the Ollama process.

Very slow output, looks like CPU not GPU. Check ollama ps , it should show "100% GPU" or "100% Metal" next to the running model. If it shows CPU, your GPU isn't being detected. On Windows/Linux, update your NVIDIA driver and reboot. On Mac, this should never happen on Apple Silicon.

Out of memory or "model is too big for your GPU." Pick a smaller model or smaller quant. Try llama3.2:1b (which is ~1 GB) instead of 3b. Or wait for the next tier of hardware. Memory math from post 7 is your guide.

Output is gibberish or stops mid-word. Usually a chat-template mismatch (rare with Ollama's curated model list, common when you sideload custom GGUFs). Stick with ollama pull'd models for your first month.

Try it on your actual workflow

Now that the stack works, the question becomes whether it does anything useful. Try one thing today:

  • Drafting a commit message. git diff | ollama run llama3.2:3b "summarize this diff as a one-line commit message". Pipe a real diff in.
  • Summarizing a doc. cat README.md | ollama run llama3.2:3b "summarize in three bullets".
  • A regex. ollama run llama3.2:3b "write a regex that matches Indian PAN numbers (5 letters, 4 digits, 1 letter)".

Any one of these working end-to-end is a real moment. You're using AI without anything leaving your machine, with no rate limits, with no API key.

Where the model lives

Ollama stores models in:

  • macOS / Linux: ~/.ollama/models/
  • Windows: %LOCALAPPDATA%\Ollama\models\

Each model is a few large files (the GGUF blob plus a manifest). If you fill your disk, this is where to look first. du -sh ~/.ollama/models/* lists them by size.

Two helpful switches

A couple of environment variables that make Ollama nicer:

  • OLLAMA_KV_CACHE_TYPE=q8_0 , enables Q8 KV cache (post 4). Roughly halves memory at long contexts. Recommended for everyone.
  • OLLAMA_NUM_PARALLEL=2 , handle 2 concurrent requests instead of queuing. Useful when running a chat UI alongside scripts.

Set these in your shell profile (~/.zshrc or ~/.bashrc) and they apply to all future Ollama runs.

What's next

You have a working local LLM. The next post is integrating it into your real workflow: VS Code plugins (Continue, Cline), web UIs (Open WebUI, LibreChat), and the OpenAI-API-swap pattern that lets existing AI tools point at your local stack.

From the dictionary

Terms used in this post

Quick reference for the 11 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI
Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
GGUFML
GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral
A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI
A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML
Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...