5 min read

Local agents and tool use

Function calling on open models in 2026: which models actually work (Qwen 2.5, Hermes 3, Llama 14B+), why local agents fail when they fail, and how to build defensive scaffolding around them.

Local agents and tool use

A chat model that only talks is a smart parrot. The same model wired up to call tools , read a file, run a query, hit an API , is an agent. Open models do this in 2026, but with caveats the cloud-frontier crowd doesn't have to think about.

This is post 11 of 13 in the Local LLMs series. After this you'll know which open models to trust with tool use, why the loops fail when they fail, and how to build a working local agent anyway.

Tool use in one paragraph

Agent tool-call loop

The mechanism: instead of just generating text, the model can output a structured "function call" , a name and a JSON object of arguments. Your code parses that, runs the named function, takes the result, and feeds it back to the model as another message. The model continues with the new information. Loop until the model decides it's done.

It's a five-step dance:

  1. User asks the model something that needs a tool.
  2. Model emits a function call: {"name": "read_file", "arguments": {"path": "README.md"}}.
  3. Your code runs the function, gets the file contents.
  4. You feed the contents back to the model.
  5. Model generates the actual answer using the file contents.

Cloud frontier models (Claude, GPT-5) do this nearly flawlessly. Open models do it, but with more failure modes.

Which open models actually do tool use

The 2026 leaderboard for function calling on open weights:

  • Qwen 2.5 Instruct (7B+). The most reliable open option. Trained heavily on tool use. JSON output is generally well-formed.
  • Llama 3.1 / 3.2 Instruct (8B+). Tool use is supported, but flakier than Qwen. The 70B is solid; the 8B sometimes invents tool names.
  • Hermes 3 (Llama-based fine-tunes). Specifically tuned for tool calling. Often beats vanilla Llama Instruct.
  • Mistral Small / Medium 3. Solid function calling, especially for European languages.
  • DeepSeek V4. Tool use supported but less battle-tested in production agent loops.

What does not work well: 3B-class models for multi-step agentic loops. They can do single-shot tool calls if you guide them carefully. They cannot reliably plan and execute a sequence of tool calls. For agents, 7B is the practical floor; 14B+ is where it starts feeling reliable.

A minimal local agent

Ollama exposes function-calling via the OpenAI-compatible API. Define tools as JSON schemas; the model emits structured calls. Here's a working agent that can read files and run shell commands:

# local-agent.py

from openai import OpenAI
import subprocess
import json

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

tools = [
    {
        "type": "function",
        "function": {
            "name": "read_file",
            "description": "Read a file from disk",
            "parameters": {
                "type": "object",
                "properties": {"path": {"type": "string"}},
                "required": ["path"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "run_command",
            "description": "Run a shell command and return stdout",
            "parameters": {
                "type": "object",
                "properties": {"cmd": {"type": "string"}},
                "required": ["cmd"],
            },
        },
    },
]

def execute(name, args):
    if name == "read_file":
        return open(args["path"]).read()
    if name == "run_command":
        return subprocess.check_output(args["cmd"], shell=True, text=True)

def agent(user_prompt, max_steps=5):
    messages = [{"role": "user", "content": user_prompt}]
    for _ in range(max_steps):
        resp = client.chat.completions.create(
            model="qwen2.5:14b", messages=messages, tools=tools
        )
        msg = resp.choices[0].message
        messages.append(msg)
        if not msg.tool_calls:
            return msg.content
        for call in msg.tool_calls:
            args = json.loads(call.function.arguments)
            result = execute(call.function.name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": call.id,
                "content": str(result),
            })
    return "Hit step limit."

print(agent("How many Python files are in the current directory?"))

Run it. With Qwen 2.5 14B, it should:

  1. Decide it needs to run a shell command.
  2. Call run_command with find . -name "*.py" | wc -l.
  3. Receive the count.
  4. Reply with the answer.

Three messages, one tool call, one short answer. That's the basic agent loop.

Where local agents fail

Honest catalog of what goes wrong, in rough frequency:

  • Wrong tool name. Model invents a function that doesn't exist ("name": "list_files"). Common on smaller models. Defense: validate names against your tool list before executing; return an error message that lists actual tool names.
  • Malformed JSON arguments. Trailing commas, unquoted strings, missing required fields. Common across all sizes. Defense: try to parse, on failure feed the error back to the model with "your JSON was invalid, please retry."
  • Loops without progress. Model calls the same tool with the same args twice. Or worse, three times. Defense: detect repetition, inject a message explaining "you already tried this; pick a different approach."
  • Premature stopping. Model gives up after one tool call when the task needs three. Defense: nothing automatic; sometimes you have to add explicit step planning to the prompt.
  • Hallucinated results. Model claims a tool returned data when no tool was called. Defense: never trust the model's narration of tool output; always show the user what tools actually ran.

These don't disqualify local agents. They just mean local agents need more defensive scaffolding than Claude or GPT-5 do.

Frameworks that make this easier

You don't have to write agent loops from scratch. The 2026 frameworks that work well with local models:

  • smolagents (HuggingFace). Lightweight, designed for any OpenAI-compatible backend. The closest to "just import and use."
  • LangGraph. Heavier framework with state machines and explicit graphs. Good for production agents where you need control flow guarantees.
  • CrewAI. Multi-agent framework. Each agent has a role, tools, and goals. Works with Ollama.
  • LlamaIndex. Originally a RAG framework, now also has agents. Pairs naturally with the embedding stack from post 10.

All four support pointing at Ollama or LM Studio. Pick smolagents for first projects; graduate to LangGraph when you need real graphs.

Tool design for local models

Cloud frontier models are forgiving about tool design. Local models are pickier. Rules that help:

  • Few tools per agent. 3–6 is the sweet spot for 14B-class models. More than 10 and the model starts mixing them up.
  • Clear, narrow descriptions. "Read a file from the local filesystem given a path" beats "Get content." Smaller models lean heavily on the description.
  • Required fields, no optional. Optional fields confuse smaller models. Make everything required, with sensible defaults documented in the description.
  • Idempotent where possible. If the model retries (it will), the second attempt shouldn't break things.
  • One job per tool. Don't make a manage_database tool. Make read_row, write_row, list_tables. Smaller models compose better than they branch.

These are the same principles that make tools easy for human users. Local models are basically junior developers , clear interfaces help.

A realistic local agent stack

What I actually use locally:

  • Model: Qwen 2.5 14B Instruct on a 24GB+ machine.
  • Framework: smolagents for prototypes, hand-rolled loops for anything I ship.
  • Tools: 4–6 tools max, each one-purpose, all required-args, all returning strings or numbers.
  • Defensive scaffolding: tool-name validation, JSON parse retries, repetition detection.
  • Fallback: if local fails the task three times, escalate to a cloud frontier model.

The escalation pattern is honest. Local agents are great for 70% of agent tasks. The remaining 30% (deep planning, ambiguous instructions, novel tool combinations) still wants a frontier model. Don't fight that; design for it.

What's next

You've got chat, RAG, and agents , all local. The last big lever is changing the model itself: fine-tuning. The next post is LoRA, QLoRA, when fine-tuning beats prompting and RAG (rarely), and how to do it on consumer hardware.

From the dictionary

Terms used in this post

Quick reference for the 12 terms you met above. Each one comes from the AI dictionary.

APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI
Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
e.g. This blog's create-post skill drafts inline using Claude.
EmbeddingNLP
A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Fine-TuningML
Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GPTAI
OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI
A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI
A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP
Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
WeightsML
The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...