May 15, 20268 min read

Troubleshooting local LLMs (and how to keep up after this series)

The full catalog of local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, bad RAG hits, tool-call hallucination. Plus where to follow the field once you're on your own.

Local LLMs break in a small number of ways, and once you've hit each one a couple of times you'll diagnose it in seconds. Wrong template? You'll know the smell. OOM? You'll know that too. This post is the catalog: every common failure mode, the fix, and the loop for keeping up once the series ends and you're on your own.

This is post 13 of 13 in the Local LLMs series. After this one you've got the full kit: the knowledge, the running stack, and a debugging playbook.

OOM (out of memory)

Local LLM failure modes

This is the one you'll hit most. The model file plus the runtime plus the KV cache added up to more than your VRAM (or unified memory) had to give.

Symptom. Ollama or LM Studio errors out with "out of memory," or the model just never finishes loading. On Linux the OOM killer sometimes takes the process outright. On macOS the whole system goes sluggish first, then errors.

Diagnosis.

# what's actually loaded right now

ollama ps

The output shows model size and what percentage sits on GPU versus CPU. If a model is 100% GPU but only just fits, the moment you raise context length or start a second model alongside it, you'll OOM.

Fixes, in order:

Lower context length. ollama run llama3.1:8b defaults to 4K context. Bump it only when you actually need to. With KV cache off, 32K context can eat more memory than the weights themselves.
Enable Q8 KV cache (post 4). Set OLLAMA_KV_CACHE_TYPE=q8_0 in your shell profile and restart Ollama. Roughly halves cache memory.
Drop to a smaller quant. Q5_K_M to Q4_K_M to Q3_K_S. There's a real quality cost, but it's small at the first step.
Drop to a smaller model. Llama 3.1 8B → Llama 3.2 3B. A 3B that fits beats an 8B that doesn't.
Close other GPU apps. On Mac, browsers and Electron apps quietly hold VRAM. Quit them.

If the OOM hits mid-conversation rather than at startup, that's your KV cache growing past your headroom as the context fills up. Cap n_ctx lower, or turn on cache quant.

Slow output (low tok/s)

Symptom. A 7B model that should run at 50+ tok/s is crawling along at 5 tok/s.

Diagnosis.

# check gpu vs cpu split

ollama ps

If you see "100% CPU," or any split where CPU is above 0%, your model isn't fully on the GPU. That's your problem right there.

Fixes:

GPU not detected. Driver issue. Update your NVIDIA driver and reboot (Windows/Linux). On Mac, Apple Silicon should always get detected. If you're on an Intel Mac, that's the issue.
GPU detected but model spilled to CPU. Model plus KV cache went over VRAM. Lower context, lower quant, or pick a smaller model.
Wrong runtime backend. On Linux with NVIDIA, make sure CUDA is enabled in your llama.cpp build. On Mac, make sure Metal is enabled.
Thermal throttling. Laptop running hot? The GPU clocks itself down. Plug into power, prop the laptop up, put a fan nearby. Sounds silly right up until it fixes your tok/s.

Garbage output, infinite repetition

Symptom. The model spits gibberish, gets stuck repeating the same phrase, or starts inventing fake user turns.

Diagnosis. It's almost always one of two things: the wrong chat template, or a base model being used as an instruct model.

Fixes:

Verify base vs instruct. If you pulled llama3.1:8b from Ollama, that's the instruct version, you're fine. If you sideloaded a custom GGUF without the Instruct suffix, you might be running the base model, and base models aren't chat-tuned.
Mismatched chat template. Ollama auto-detects the template for its own catalog models. For sideloaded ones, check the GGUF metadata with ollama show <model> and override the template in a Modelfile if you need to.
Stop tokens missing. The model doesn't know when to stop talking. Set the stop parameter in your API calls. For Llama models it's <|eot_id|>. For Qwen it's <|im_end|>.

If the output is gibberish from the very first token, not just after a few good ones, the file is corrupt or in the wrong format. Re-pull it.

"Model not following instructions"

Symptom. The model ignores your system prompt, skips the format you asked for, or hands you a general answer when you wanted something specific.

Diagnosis. Smaller models drift more. And the fix is almost always prompt engineering, not a different model.

Fixes:

Move instructions into the user message, not just the system prompt. Smaller open models often weight the system prompt less heavily than instruct-tuned cloud models do.
Add explicit examples (few-shot). Two or three concrete input/output examples beat five paragraphs of instructions every time.
Use grammar-constrained output for strict formats. llama.cpp's --grammar flag forces JSON, regex-like patterns, and so on. It's slower but reliable.
Lower temperature. A 0.7 default is creative. For instruction following, drop it to 0.2-0.3.

Tried all of that and it still won't behave? Then you probably picked a model below the size threshold for what you're asking of it. A 3B model can't reliably produce nine-step structured outputs. Grab a 7B or bigger and try again.

Tool calls fail or hallucinate

Symptom. The model invents tool names, returns malformed JSON, or claims it got results without ever calling a tool.

We covered this in post 11. Quick recap of the three fixes:

Validate tool names server-side, and return errors that include the actual tool list.
Try to parse the JSON, and on failure feed the parse error straight back to the model.
Detect repetition and inject a "you already tried this" nudge.

None of this is local-specific, but you'll need it more on local than you ever did on Claude.

RAG retrieves irrelevant chunks

Symptom. Your "ask my notes" pipeline keeps returning documents that have nothing to do with the question.

Diagnosis. It's almost always one of these:

Same embedding model for index and query. Index with nomic-embed, then switch to bge-large for queries, and you're comparing two different vector spaces. Re-index everything with one model.
Chunk size too big or too small. Too big and the chunks get dilute, multiple ideas mashed together. Too small and they lose context. 300-800 tokens is usually the sweet spot.
No reranking. First-pass retrieval finds the "loosely related" stuff. A reranker like bge-reranker-base re-scores for "actually answers the question." It costs a second model pass, but the precision jump is huge.

"It worked yesterday and doesn't today"

Symptom. Same model, same prompt, and suddenly it's different.

Common causes:

Ollama upgraded. Run ollama --version and check the release notes. Behavior changes between versions are rare, but they're real.
Model upgraded. ollama pull llama3.1:8b can quietly pull a newer revision. The :latest tag is mutable.
Different KV cache state. Long-running conversations build up context that shifts the responses. Restart the chat.
Random seed. The default 0.7 temperature introduces variance. Set temperature: 0 and seed: 42 for reproducible output.

For anything production-facing, pin your model versions: ollama pull llama3.1:8b-instruct-q4_K_M, not llama3.1:8b. Specific quants and versions stay stable. Latest tags drift on you.

Disk filling up

Symptom. "No space left on device" while you're pulling a model, or just in general.

# what's eating disk in the ollama models dir

du -sh ~/.ollama/models/manifests/registry.ollama.ai/library/* 2>/dev/null | sort -rh | head -10

# remove a model

ollama rm llama3.1:70b

Models pile up fast. A couple of 70B variants and you've got 100+ GB of SSD tied up. Run ollama list now and then, and ollama rm whatever you're not using.

On macOS, also check ~/.cache/huggingface/ if you've been using PEFT or transformers directly. That cache grows on its own, separate from Ollama's.

Keeping up

The local-LLM space moves fast. Here's where I follow it:

r/LocalLLaMA on Reddit. The community hub. New models, benchmarks, fresh quants, hardware reports. Skim it weekly.
HuggingFace trending models. huggingface.co/models?sort=trending. Filter by GGUF and you'll see what people are actually downloading.
Ollama's library page. ollama.com/library. A curated set, updated when something notable drops.
Simon Willison's blog. Tracks the field broadly, with concrete reports on actually running things.
Papers with Code → LLM Leaderboard. For benchmarks. Take them with salt, benchmarks are gameable.

How often does anything meaningful change in 2026? A real new model family every 2-4 weeks. A tooling improvement every week. Most of it isn't actionable for users, and the few things that are tend to be obvious (when Llama 4 dropped, half the ecosystem updated within a month).

Me, I check r/LocalLLaMA on Sunday mornings, scan the trending HuggingFace models when I'm bored on a train, and otherwise let the field come to me. The fundamentals (Ollama, Llama, Qwen, the techniques in this series) change slowly. The leaderboard churns every week, but you don't have to ride every wave.

A version-pinning template

For reproducible local-LLM setups, pin everything:

# pin ollama version

OLLAMA_VERSION="0.5.7"  # check ollama.com for latest stable

# pin model and quantization

ollama pull llama3.1:8b-instruct-q4_K_M

# pin embedding model

ollama pull nomic-embed-text:v1.5

Write these into your project's README. Six months from now, you'll thank yourself.

Closing the series

That's the whole thing. Thirteen posts, start to finish:

Post 1 made the case for going local in 2026.
Posts 2-4 covered the vocabulary, compression, and runtime mechanics.
Posts 5-7 covered model selection, OS prerequisites, and per-tier hardware guidance.
Post 8 walked through the Hello World install.
Post 9 wired the model into VS Code, web UIs, and existing apps.
Post 10 added local RAG.
Post 11 added agents and tool use.
Post 12 covered fine-tuning for when you actually need it.
This one is the failure-mode catalog and the keeping-up loop.

Here's what I want you to walk away with. Every machine can run a useful local LLM today. Not a downgraded version. Not a toy. A real tool. The 16GB MacBook Air, the older Windows laptop, the gaming PC with the second-hand 4060. All of them.

So pull a model this week and use it for something real. The cloud APIs aren't going anywhere, they'll be right there when you need them. But spend a month with a local model in your toolbox and "is this worth running locally?" stops being a default you reach for and becomes a question you answer per task. That's the whole point. Go run something.

AI Debugging LLM Local Llms Troubleshooting

From the dictionary

Terms used in this post

Quick reference for the 23 terms you met above. Each one comes from the AI dictionary.

APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Few-ShotNLP: Showing the LLM a handful of input/output examples in the prompt before the real query, so it picks up the pattern. Cheap and effective; usually the next thing to try after a plain prompt.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP: A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
TransformerDL: The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Rate this article

How helpful did you find this?

Series

Local Llms

13 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

APIGeneral: Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Few-ShotNLP: Showing the LLM a handful of input/output examples in the prompt before the real query, so it picks up the pattern. Cheap and effective; usually the next thing to try after a plain prompt.
Fine-TuningML: Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
GGUFML: GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
GPUGeneral: A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
llama.cppAI: A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
LM StudioAI: A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
QuantizationML: Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
System PromptNLP: A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
TransformerDL: The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.
VRAMGeneral: Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
WeightsML: The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Series

Local Llms

13 / 13 posts

Browse all in Local Llms →