May 1, 20265 min read

Wiring a local LLM into the tools you already use

How to point VS Code (Continue, Cline), web chat UIs (Open WebUI, LibreChat, Page Assist), and your own code at a local model using the OpenAI-compatible API. Swap cloud for local without rewriting anything.

A local LLM sitting in a terminal is a party trick. The same model wired into your editor, your chat UI, and the apps you open every day is a tool you actually keep. This post is the wiring.

This is post 9 of 13 in the Local LLMs series. By now you've got Ollama running and a working chat. Time to make it earn its keep.

The one trick the whole post hangs on

Local stack masquerading as OpenAI

Almost every AI tool built since 2023 talks to the OpenAI API. The format became a de-facto standard: a POST to /v1/chat/completions with messages, and you get back a streaming or non-streaming response.

Here's the useful part. Ollama and LM Studio both expose that exact same API on localhost. So any tool that lets you type in a "Base URL" and "API key" can be pointed at your local stack instead, no code changes at all.

The pattern shows up everywhere:

Base URL: http://localhost:11434/v1 (Ollama) or http://localhost:1234/v1 (LM Studio default)
API key: ollama or any non-empty string. Ollama doesn't actually check it.
Model name: whatever you pulled, e.g. llama3.2:3b for Ollama, or the loaded model in LM Studio.

That's the whole secret. Most of what follows is just "now apply that to a specific tool."

VS Code: Continue

Continue is the most popular open-source AI coding assistant for VS Code in 2026. Inline completion, chat in a sidebar, code edits. It was built to support local models from day one.

Setup goes like this:

Install the Continue extension from the VS Code marketplace.
Open Continue's config (Cmd+Shift+P → "Continue: Open Config"). It opens ~/.continue/config.json.
Replace the model section with a local block:

{
  "models": [
    {
      "title": "Local Qwen Coder",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Local Qwen 1.5B",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b-base"
  }
}

Two models here. A chat model (7B) for the sidebar, and a fast small model (1.5B base) for inline tab completion. Using the base model for autocomplete is correct, not a mistake: fill-in-the-middle works on base models, not instruct ones.

Pull both before you start:

# pull the chat coder

ollama pull qwen2.5-coder:7b

# pull the autocomplete model

ollama pull qwen2.5-coder:1.5b-base

Restart VS Code and you've got local AI right inside the editor. Tab to accept suggestions. Cmd+L (Ctrl+L on non-Mac) to ask a question in the sidebar.

VS Code: Cline

Cline is a different animal. It's an autonomous coding agent that can read, write, and run code on your behalf. It defaults to cloud models but supports any OpenAI-compatible endpoint.

To point it at Ollama: in Cline's settings, set the API provider to "Ollama" or "OpenAI-Compatible," base URL to http://localhost:11434/v1, and pick a model that handles function calling well (qwen2.5:14b or larger is the recommendation).

One honest caveat. Cline-style agentic loops lean on the model far harder than plain chat does. Smaller local models under 7B usually fall apart on multi-step file edits. 14B+ instruct models hold up, and 32B works well. If you want full agent reliability, frontier cloud models still beat local in 2026.

Other editors

Same trick, different plugin name:

JetBrains IDEs (IntelliJ, PyCharm, etc.). Continue has a JetBrains extension. Same config file, same flow.
Neovim. llama.vim for inline completion through llama.cpp's server, or avante.nvim for chat with OpenAI-compatible providers.
Zed. Built-in "language model" config supports Ollama natively (Settings → AI → Provider → Ollama).
Sublime / VS Code via the Cursor fork. Cursor lets you set a custom API base URL in Settings → Models. Point it at Ollama.

If your editor isn't on this list, search "<editor name> ollama" or "<editor name> openai compatible." Odds are someone built a plugin back in 2024 or 2025.

Web UIs: Open WebUI

Open WebUI is the most polished local-first web chat UI going. It looks like ChatGPT, talks to Ollama (or any OpenAI-compatible API), and runs in Docker.

# run open webui pointing at local ollama

docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main

Open http://localhost:3000. First time through, create a local account (purely local, no cloud signup). Then start chatting.

What it gives you over the bare ollama run CLI:

Multi-conversation history.
Markdown rendering in responses.
Image uploads (works with vision models).
Document uploads with auto-RAG.
Multi-user mode, so your home network can share one instance.

This is the thing the non-engineers in your house will actually use. Set it up once on a home server and the family can chat with the local model from their phones.

Web UIs: the alternatives

LibreChat. Same idea, more features, more setup to wrangle. Good if you want a serious team-grade UI.
AnythingLLM. Bundles a vector DB so RAG works out of the box. Handy if RAG is your main use case (that's post 10).
Page Assist. A browser extension for Chrome and Firefox. Adds a chat sidebar and lets you ask your local model about the page you're currently on. Tiny install, surprisingly useful.
LM Studio's built-in UI. If LM Studio is already your runtime, its own chat UI is fine and needs zero config.

The "swap cloud for local" move

For your own apps, the same trick still works. Anywhere you've got:

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

You can switch to:

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

And the rest of your code stays exactly as it was. client.chat.completions.create(...) behaves the same. Streaming works. Tool calling works too, within whatever the model can handle (see post 11).

Some real swaps people actually do:

A nightly script that summarizes the day's commits. It used to call OpenAI. Now it calls local, and the bill went from a few dollars a month to zero.
A classification job that tags incoming support tickets. Moved to Llama 3.1 8B local. It's faster than the old cloud call because there's no network round trip.
A code review bot. The actual review stayed on Claude because capability matters there, but the "format the diff" preprocessing step moved local.

The split is the win, not the wholesale replacement. Use cloud for what cloud does best, and local for what local does best.

Authentication and getting at it remotely

Two questions come up every time.

Can I reach my home Ollama from my laptop on the train? By default, no. Ollama listens on localhost only. To open it up to your home network, set OLLAMA_HOST=0.0.0.0:11434 and restart the daemon. To get at it from outside your home, use Tailscale or a VPN. Never expose Ollama straight to the public internet without auth in front of it.

Does Ollama have authentication? No, and that's by design. It's a local tool. If you need auth, put it behind a reverse proxy (Caddy, nginx) with basic auth or a token, or run Open WebUI, which has user accounts and proxies to Ollama internally.

What's next

Your local LLM now lives inside your tools. What it still doesn't have is your data. The next post is local RAG: embeddings models, vector DBs, and the fully local stack behind "ask my notes folder a question."

AI Integration LLM Local Llms Vscode Workflow

From the dictionary

Terms used in this post

Quick reference for the 13 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.; e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.

Rate this article

How helpful did you find this?

Series

Local Llms

9 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

Artificial IntelligenceAI: Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ChatGPTAI: OpenAIs consumer chat product, launched November 30, 2022. The first LLM to reach mass adoption — 100 million users in two months. The product most people mean when they say AI today.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.

Series

Local Llms

9 / 13 posts

Browse all in Local Llms →