May 8, 20265 min read

Local agents and tool use

Function calling on open models in 2026. Which ones actually work, why local agents break when they break, and the scaffolding that keeps them upright.

A chat model that only talks is a smart parrot. Wire the same model up to call tools, so it can read a file, run a query, hit an API, and now it's an agent. Open models can pull this off in 2026. They just come with caveats the cloud-frontier crowd never has to think about.

This is post 11 of 13 in the Local LLMs series. By the end you'll know which open models to trust with tool use, why the loops break when they break, and how to ship a working local agent regardless.

Tool use in one paragraph

Agent tool-call loop

Here's the mechanism. Instead of just generating text, the model can emit a structured "function call": a name plus a JSON object of arguments. Your code parses that, runs the named function, grabs the result, and feeds it back to the model as another message. The model keeps going with the new information. You loop until it decides it's done.

It's a five-step dance:

User asks the model something that needs a tool.
Model emits a function call: {"name": "read_file", "arguments": {"path": "README.md"}}.
Your code runs the function, gets the file contents.
You feed the contents back to the model.
Model generates the actual answer using the file contents.

Cloud frontier models (Claude, GPT-5) do this nearly flawlessly. Open models do it too. They just have more ways to trip.

Which open models actually do tool use

The 2026 leaderboard for function calling on open weights:

Qwen 2.5 Instruct (7B+). The most reliable open option. Trained heavily on tool use. JSON output is generally well-formed.
Llama 3.1 / 3.2 Instruct (8B+). Tool use is supported, but flakier than Qwen. The 70B is solid. The 8B sometimes invents tool names.
Hermes 3 (Llama-based fine-tunes). Tuned specifically for tool calling. Often beats vanilla Llama Instruct.
Mistral Small / Medium 3. Solid function calling, especially for European languages.
DeepSeek V4. Tool use is supported but less battle-tested in production agent loops.

What doesn't work well: 3B-class models for multi-step agentic loops. Guide them carefully and they can manage a single-shot tool call. They cannot reliably plan and execute a sequence of them. For agents, 7B is the practical floor. 14B and up is where it starts feeling reliable.

A minimal local agent

Ollama exposes function-calling through the OpenAI-compatible API. You define tools as JSON schemas and the model emits structured calls. Here's a working agent that can read files and run shell commands:

# local-agent.py

from openai import OpenAI
import subprocess
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a file from disk",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_command",
            "description": "Run a shell command and return stdout",
            "parameters": {
                "type": "object",
                "properties": {"cmd": {"type": "string"}},
                "required": ["cmd"],
            },
        },
    },
]

def execute(name, args):
    if name == "read_file":
        return open(args["path"]).read()
    if name == "run_command":
        return subprocess.check_output(args["cmd"], shell=True, text=True)

def agent(user_prompt, max_steps=5):
    messages = [{"role": "user", "content": user_prompt}]
    for _ in range(max_steps):
        resp = client.chat.completions.create(
            model="qwen2.5:14b", messages=messages, tools=tools
        )
        msg = resp.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = execute(call.function.name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": str(result),
            })
    return "Hit step limit."

print(agent("How many Python files are in the current directory?"))

Run it. With Qwen 2.5 14B, it should:

Decide it needs to run a shell command.
Call run_command with find . -name "*.py" | wc -l.
Receive the count.
Reply with the answer.

Three messages, one tool call, one short answer. That's the basic loop, start to finish.

Where local agents fail

An honest catalog of what goes wrong, roughly in order of how often you'll see it:

Wrong tool name. The model invents a function that doesn't exist ("name": "list_files"). Common on smaller models. Defense: validate names against your tool list before executing, and return an error message that lists the actual tool names.
Malformed JSON arguments. Trailing commas, unquoted strings, missing required fields. Common across all sizes. Defense: try to parse, and on failure feed the error back to the model with "your JSON was invalid, please retry."
Loops without progress. The model calls the same tool with the same args twice. Or worse, three times. Defense: detect the repetition, then inject a message saying "you already tried this; pick a different approach."
Premature stopping. The model gives up after one tool call when the task needs three. Defense: nothing automatic. Sometimes you just have to add explicit step planning to the prompt.
Hallucinated results. The model claims a tool returned data when no tool ran at all. Defense: never trust the model's narration of tool output. Always show the user what tools actually ran.

None of this disqualifies local agents. It just means they need more defensive scaffolding than Claude or GPT-5 do.

Frameworks that make this easier

You don't have to write agent loops from scratch. The 2026 frameworks that play well with local models:

smolagents (HuggingFace). Lightweight, built for any OpenAI-compatible backend. The closest thing to "just import and use."
LangGraph. Heavier, with state machines and explicit graphs. Good for production agents where you need control-flow guarantees.
CrewAI. Multi-agent framework. Each agent gets a role, tools, and goals. Works with Ollama.
LlamaIndex. Started as a RAG framework, now has agents too. Pairs naturally with the embedding stack from the previous post.

All four can point at Ollama or LM Studio. Pick smolagents for first projects. Graduate to LangGraph when you need real graphs.

Tool design for local models

Cloud frontier models are forgiving about tool design. Local models are pickier. Rules that help:

Few tools per agent. Three to six is the sweet spot for 14B-class models. Push past ten and the model starts mixing them up.
Clear, narrow descriptions. "Read a file from the local filesystem given a path" beats "Get content." Smaller models lean hard on the description.
Required fields, no optional ones. Optional fields confuse smaller models. Make everything required, with sensible defaults documented in the description.
Idempotent where possible. The model will retry. When it does, the second attempt shouldn't break things.
One job per tool. Don't build a manage_database tool. Build read_row, write_row, list_tables. Smaller models compose better than they branch.

These are the same principles that make tools easy for human users. Local models are basically junior developers, and clear interfaces help.

A realistic local agent stack

What I actually run locally:

Model: Qwen 2.5 14B Instruct on a 24GB+ machine.
Framework: smolagents for prototypes, hand-rolled loops for anything I ship.
Tools: 4 to 6 max, each one-purpose, all required-args, all returning strings or numbers.
Defensive scaffolding: tool-name validation, JSON parse retries, repetition detection.
Fallback: if local fails the task three times, escalate to a cloud frontier model.

That escalation pattern is the honest part. Local agents handle 70% of agent tasks beautifully. The other 30% (deep planning, ambiguous instructions, novel tool combinations) still wants a frontier model. Don't fight that. Design for it.

What's next

Chat, RAG, agents, all running on your own hardware. The last big lever is changing the model itself: fine-tuning. Next post covers LoRA, QLoRA, when fine-tuning beats prompting and RAG (rarely), and how to do it on consumer hardware.

Agents AI Function Calling LLM Local Llms Tool Use

From the dictionary

Terms used in this post

Quick reference for the 12 terms you met above. Each one comes from the AI dictionary.

APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

Local Llms

11 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPTAI: OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

Local Llms

11 / 13 posts

Browse all in Local Llms →