5 min read

Integrating a local LLM into your workflow

Wire your local LLM into VS Code (Continue, Cline), web UIs (Open WebUI, LibreChat, Page Assist), and your own apps via the OpenAI-compatible API. The swap-cloud-for-local pattern in real codebases.

Integrating a local LLM into your workflow

A local LLM in a terminal is a curiosity. A local LLM wired into your editor, your chat UI, and the apps you already use is a tool. This post is the wiring.

This is post 9 of 13 in the Local LLMs series. After post 8 you have Ollama running and a working chat. Now we make it do real work.

The OpenAI-compatible API trick

Local stack masquerading as OpenAI

Almost every AI tool built since 2023 talks to the OpenAI API. The format is a de-facto standard: a POST to /v1/chat/completions with messages, returning a streaming or non-streaming response.

Ollama and LM Studio both expose the exact same API on localhost. So any tool that lets you set "Base URL" and "API key" in its settings can be redirected to your local stack with no code changes.

The pattern, everywhere:

  • Base URL: http://localhost:11434/v1 (Ollama) or http://localhost:1234/v1 (LM Studio default)
  • API key: ollama or any non-empty string. Ollama doesn't validate it.
  • Model name: whatever you pulled, e.g. llama3.2:3b for Ollama or the loaded model in LM Studio.

That's the magic. Most of this post is variations on "now apply that to specific tools."

VS Code: Continue

Continue is the most popular open-source AI coding assistant for VS Code in 2026. Inline completion, chat in a sidebar, code edits. Designed from the start to support local models.

Setup:

  1. Install the Continue extension from the VS Code marketplace.
  2. Open Continue's config (Cmd+Shift+P → "Continue: Open Config"). It opens ~/.continue/config.json.
  3. Replace the model section with a local block:
{
  "models": [
    {
      "title": "Local Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Local Qwen 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base"
  }
}

Two models: a chat model (7B) for the sidebar, and a fast small model (1.5B base) for inline tab completion. The base model for autocomplete is correct here , fill-in-the-middle works on base, not instruct.

Pull both before using:

# pull the chat coder

ollama pull qwen2.5-coder:7b
# pull the autocomplete model

ollama pull qwen2.5-coder:1.5b-base

Restart VS Code. You now have local AI inline in your editor. Tab to accept suggestions. Cmd+L (Ctrl+L on non-Mac) to ask a question in the sidebar.

VS Code: Cline

Cline is a different beast , an autonomous coding agent that can read, write, and execute code on your behalf. It works with cloud models by default but supports any OpenAI-compatible endpoint.

Configure Cline to use Ollama: in Cline's settings, set the API provider to "Ollama" or "OpenAI-Compatible," base URL to http://localhost:11434/v1, and pick a model that does function calling well (qwen2.5:14b or larger recommended).

Honest caveat: Cline-style agentic loops put more pressure on the model than chat does. Smaller local models (under 7B) usually fail at multi-step file edits. 14B+ instruct models work; 32B works well. For full agent reliability, frontier cloud models still beat local in 2026.

Other editors

Same pattern, different plugin name:

  • JetBrains IDEs (IntelliJ, PyCharm, etc.). Continue has a JetBrains extension. Same config file, same flow.
  • Neovim. llama.vim for inline completion via llama.cpp's server, or avante.nvim for chat with OpenAI-compatible providers.
  • Zed. Built-in "language model" config supports Ollama natively (Settings → AI → Provider → Ollama).
  • Sublime / VS Code via Cursor fork. Cursor lets you set a custom API base URL in Settings → Models. Point at Ollama.

If your editor isn't here, search "<editor name> ollama" or "<editor name> openai compatible" , odds are someone built a plugin in 2024 or 2025.

Web UIs: Open WebUI

Open WebUI is the most polished local-first web chat UI. Looks like ChatGPT, talks to Ollama (or any OpenAI-compatible API), runs in Docker.

# run open webui pointing at local ollama

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. First time, create a local account (purely local, no cloud signup). Then start chatting.

What Open WebUI gives you over the bare ollama run CLI:

  • Multi-conversation history.
  • Markdown rendering in responses.
  • Image uploads (works with vision models).
  • Document uploads with auto-RAG.
  • Multi-user mode (your home network can share an instance).

This is what most non-engineers in your house will actually use. Set it up once on a home server and family members can chat with the local model from their phones.

Web UIs: alternatives

  • LibreChat. Same idea, more features, more setup complexity. Good if you want a serious team-grade UI.
  • AnythingLLM. Bundles a vector DB so RAG works out of the box. Useful if RAG is your primary use case (post 10).
  • Page Assist. Browser extension (Chrome/Firefox). Adds a chat sidebar and lets you query the current webpage's content via your local model. Tiny install, surprisingly useful.
  • LM Studio's built-in UI. If you're already using LM Studio as your runtime, its own chat UI is fine and zero-config.

The "swap cloud for local" pattern

For your own apps, the same trick works. Anywhere you have:

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

You can change to:

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

And the rest of your code is unchanged. client.chat.completions.create(...) works the same. Streaming works. Tool calling works (within model capabilities , see post 11).

Some real-world swaps people do:

  • A nightly script that summarizes the day's commits. Used to call OpenAI; now calls local. Bill went from a few dollars a month to zero.
  • A classification job that tags incoming support tickets. Switched to Llama 3.1 8B local. Faster than the old cloud call because no network round trip.
  • A code review bot. Stayed on Claude for the actual review (capability matters), but the "format the diff" preprocessing step moved local.

The split is the win, not the full replacement. Use cloud for what cloud does best, local for what local does best.

Authentication and remote access

Two questions that come up:

Can I access my home Ollama from my laptop on the train? By default, no , Ollama listens on localhost only. To expose it to your home network, set OLLAMA_HOST=0.0.0.0:11434 and restart the daemon. To access remotely (from outside your home), use Tailscale or a VPN; never expose Ollama directly to the public internet without auth.

Does Ollama have authentication? No, by design. It's a local tool. If you need auth, put it behind a reverse proxy (Caddy, nginx) with basic auth or a token, or use Open WebUI which has user accounts and proxies to Ollama internally.

What's next

The local LLM is now in your tools, but it doesn't know your data. Next post is local RAG: embeddings models, vector DBs, and the all-local stack for "ask my notes folder a question."

From the dictionary

Terms used in this post

Quick reference for the 13 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI
Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI
OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI
Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
e.g. This blog's create-post skill drafts inline using Claude.
EmbeddingNLP
A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
llama.cppAI
A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI
A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
RAGNLP
Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML
The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...