Troubleshooting local LLMs and keeping up
The catalog of common local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, RAG miss, tool-call hallucination. Plus where to follow the field as it moves.

Local LLMs break in a small number of recognizable ways. Once you've hit each error a couple of times, the diagnostic step takes seconds. This post is the catalog of failure modes and the loop of how to keep up after the series ends.
This is post 13 of 13 in the Local LLMs series. After this one, you have the full toolkit: knowledge, hands-on stack, and a debugging playbook.
OOM (out of memory)

By far the most common failure. The model file plus runtime plus KV cache exceeded VRAM (or unified memory).
Symptom. Ollama or LM Studio errors out with "out of memory" or the model never finishes loading. On Linux, sometimes the Linux OOM killer takes the process. On macOS, the system gets sluggish before erroring.
Diagnosis.
# what's actually loaded right now
ollama ps
The output shows model size and percentage on GPU vs CPU. If a model is 100% GPU but barely fits, raising context length or running another model alongside will OOM.
Fixes, in order:
- Lower context length.
ollama run llama3.1:8bdefaults to 4K context; bump only if you need it. With KV cache off, 32K context can cost more memory than the weights. - Enable Q8 KV cache (post 4). Set
OLLAMA_KV_CACHE_TYPE=q8_0in your shell profile and restart Ollama. Roughly halves cache memory. - Drop to a smaller quant. Q5_K_M to Q4_K_M to Q3_K_S. Quality cost is real but small at the first step.
- Drop to a smaller model. Llama 3.1 8B → Llama 3.2 3B. Better than running an 8B that doesn't fit.
- Close other GPU apps. On Mac, browsers and Electron apps quietly use VRAM. Quit them.
If OOM happens mid-conversation but not at startup, the KV cache is growing past your headroom as context fills up. Cap n_ctx lower or enable cache quant.
Slow output (low tok/s)
Symptom. A 7B model that should run at 50+ tok/s is doing 5 tok/s.
Diagnosis.
# check gpu vs cpu split
ollama ps
If "100% CPU" or some split with CPU > 0%, your model isn't fully on GPU. That's the issue.
Fixes:
- GPU not detected. Driver issue. Update NVIDIA driver and reboot (Windows/Linux). On Mac, Apple Silicon should always be detected , if you're on Intel Mac, that's the issue.
- GPU detected but model spilled to CPU. Model + KV cache exceeded VRAM. Lower context, lower quant, or pick a smaller model.
- Wrong runtime backend. On Linux with NVIDIA, make sure CUDA is enabled in your llama.cpp build. On Mac, make sure Metal is enabled.
- Thermal throttling. Laptop hot? GPU clocks down. Plug into power, prop the laptop up, run with a fan nearby. Sounds silly until it solves your problem.
Garbage output, infinite repetition
Symptom. Model outputs gibberish, gets stuck repeating the same phrase, or starts generating fake user turns.
Diagnosis. Almost always one of two issues: wrong chat template, or base model used as instruct.
Fixes:
- Verify base vs instruct. If you pulled
llama3.1:8bfrom Ollama, that's the instruct version. If you sideloaded a custom GGUF without theInstructsuffix, you might have the base. Base models are not chat-tuned. - Mismatched chat template. Ollama auto-detects template for its catalog models. For sideloaded models, check the GGUF metadata (
ollama show <model>) and override the template in a Modelfile if needed. - Stop tokens missing. The model doesn't know when to halt. Set
stopparameter in API calls. For Llama models:<|eot_id|>. For Qwen:<|im_end|>.
If output is gibberish from token 1 (not just after a few good tokens), the file is corrupt or the wrong format. Re-pull.
"Model not following instructions"
Symptom. Model ignores your system prompt, doesn't follow the format you asked for, gives general answers when you wanted specific.
Diagnosis. Smaller models drift more. The fix is almost always prompt engineering, not a different model.
Fixes:
- Move instructions into the user message, not just the system prompt. Smaller open models often weight system prompts less heavily than instruct-tuned cloud models do.
- Add explicit examples (few-shot). Two or three concrete input/output examples beats five paragraphs of instructions.
- Use grammar-constrained output for strict formats. llama.cpp's
--grammarflag forces JSON, regex-like patterns, etc. Slower but reliable. - Lower temperature. A 0.7 default is creative; for instruction following, drop to 0.2–0.3.
If you've tried all of those and it still won't follow instructions, you likely picked a model below the size threshold for what you're asking. A 3B model can't reliably do nine-step structured outputs. Pick a 7B+ and try again.
Tool calls fail or hallucinate
Symptom. Model invents tool names, returns malformed JSON, or claims results without calling tools.
Covered in post 11. Quick recap of the three fixes:
- Validate tool names server-side, return errors with the actual tool list.
- Try-parse JSON, on failure feed the parse error back.
- Detect repetition and inject "you already tried this" guidance.
These aren't local-LLM specific, but they're more necessary on local than on Claude.
RAG retrieves irrelevant chunks
Symptom. Your "ask my notes" pipeline returns documents that look unrelated to the question.
Diagnosis. Almost always one of:
- Same embedding model for index and query. If you indexed with nomic-embed and switched to bge-large for queries, you're comparing different vector spaces. Re-index everything with one model.
- Chunk size too big or too small. Too big and chunks are dilute (multiple ideas). Too small and they lack context. 300–800 tokens is usually right.
- No reranking. The first-pass retrieval finds "loosely related" , a reranker like
bge-reranker-basere-scores for "actually answers the question." Adds a second model pass but huge precision improvement.
"It worked yesterday and doesn't today"
Symptom. Same model, same prompt, suddenly different.
Common causes:
- Ollama upgraded. Run
ollama --version, check release notes. Behavior changes between versions are rare but real. - Model upgraded.
ollama pull llama3.1:8bmay pull a newer revision. The:latesttag is mutable. - Different KV cache state. Long-running conversations build up context that affects responses. Restart the chat.
- Random seed. Default temperature 0.7 introduces variance. Set
temperature: 0andseed: 42for reproducible output.
For production stability, pin model versions: ollama pull llama3.1:8b-instruct-q4_K_M not llama3.1:8b. Specific quants and versions are stable; latest tags drift.
Disk filling up
Symptom. "No space left on device" while pulling a model, or in general.
# what's eating disk in the ollama models dir
du -sh ~/.ollama/models/manifests/registry.ollama.ai/library/* 2>/dev/null | sort -rh | head -10
# remove a model
ollama rm llama3.1:70b
Models accumulate fast. A few 70B variants and your SSD has 100+ GB tied up. Periodically ollama list and ollama rm what you don't use.
On macOS, also check ~/.cache/huggingface/ if you've been using PEFT or transformers directly. That cache grows separately from Ollama's.
Keeping up
The local-LLM space moves quickly. Where to follow:
- r/LocalLLaMA on Reddit. The community hub. New models, benchmarks, Quants released, hardware reports. Skim weekly.
- HuggingFace trending models.
huggingface.co/models?sort=trending. Filter by GGUF and you'll see what people are actually downloading. - Ollama's library page.
ollama.com/library. Curated set, updated when notable models drop. - Simon Willison's blog. Tracks the field generally, with concrete reports on running things.
- Papers with Code → LLM Leaderboard. For benchmarks. Take with salt; benchmarks are gameable.
Frequency of meaningful change in 2026: a real new model family every 2–4 weeks. A tooling improvement every week. Most of it isn't actionable for users, but the few things that are tend to be obvious (e.g., when Llama 4 dropped, half the ecosystem updated within a month).
I check r/LocalLLaMA on Sunday mornings, scan the trending HuggingFace models when I'm bored on a train, and otherwise let the field come to me. The fundamentals (Ollama, Llama, Qwen, the techniques in this series) change slowly. The model leaderboard changes every week, but you don't need to ride every wave.
A version-pinning template
For reproducible local-LLM setups, pin everything:
# pin ollama version
OLLAMA_VERSION="0.5.7" # check ollama.com for latest stable
# pin model and quantization
ollama pull llama3.1:8b-instruct-q4_K_M
# pin embedding model
ollama pull nomic-embed-text:v1.5
Document these in your project's README. Six months from now, you'll thank yourself.
Closing the series
That's the entire series. Thirteen posts, end to end:
- Post 1 made the case for local in 2026.
- Posts 2–4 covered the vocabulary, compression, and runtime mechanics.
- Posts 5–7 covered model selection, OS prerequisites, and per-tier hardware guidance.
- Post 8 walked through the Hello World install.
- Post 9 wired the model into VS Code, web UIs, and existing apps.
- Post 10 added local RAG.
- Post 11 added agents and tool use.
- Post 12 covered fine-tuning when you actually need it.
- This post is the failure-mode catalog and how to keep up.
The thing I want you to leave with: every machine can run a useful local LLM today. Not a downgraded version. Not a toy. A real tool. The 16GB MacBook Air, the older Windows laptop, the gaming PC with the second-hand 4060. All of them.
Pull a model this week. Use it for something real. The cloud APIs aren't going anywhere , they'll still be there when you need them. But once you've spent a month with a local model in your toolbox, the question of "is this worth running locally?" becomes one you can answer per task, not as a default.
From the dictionary
Terms used in this post
Quick reference for the 23 terms you met above. Each one comes from the AI dictionary.
- APIGeneral
- Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
- ClaudeAI
- Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
- e.g. This blog's create-post skill drafts inline using Claude.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Few-ShotNLP
- Showing the LLM a handful of input/output examples in the prompt before the real query, so it picks up the pattern. Cheap and effective; usually the next thing to try after a plain prompt.
- Fine-TuningML
- Continuing to train an existing model on new data, so the new patterns get baked into the weights. Distinct from RAG (which only changes the prompt) and prompting (which changes nothing).
- GGUFML
- GPT-Generated Unified Format. A single-file binary format for storing quantized model weights, tokenizer, and metadata. Used by llama.cpp, Ollama, and LM Studio. A 7B model in Q4 quantization is roughly 4GB; the same model in Q8 is roughly 8GB.
- GPUGeneral
- A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
- llama.cppAI
- A C++ implementation of LLM inference designed to run quantized models on consumer hardware (CPU, CUDA, Metal, Vulkan). The de-facto local inference engine that Ollama and LM Studio both wrap. Supports GGUF format, has a built-in HTTP server, and is the reference for what local LLMs can actually do.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- LM StudioAI
- A GUI app for running local LLMs, wrapping llama.cpp with a chat interface and a model browser. Easier than Ollama for non-CLI users; same underlying engine. Useful for quick model evaluation; less useful for scripting or production-style workflows.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- QuantizationML
- Compressing model weights from 16-bit floats (FP16) to lower-precision integers (Q8, Q5, Q4) to reduce memory footprint and speed up inference. Q4 cuts size by ~4x with minor quality loss; Q2 saves more but degrades noticeably. The standard trick that makes 70B models fit on consumer hardware.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- System PromptNLP
- A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- TransformerDL
- The neural-network architecture every modern LLM is built on. Introduced by Google in the 2017 paper "Attention Is All You Need". GPT, Claude, Gemini, Llama, Mistral — all transformers.
- VRAMGeneral
- Video RAM on a discrete GPU. The hard ceiling on which models you can run: an RTX 4090 has 24GB, an A100 has 40-80GB, an H100 has 80GB. A 70B Q4 model needs ~40GB just for weights, before activations and KV cache. On Apple Silicon, unified memory plays the same role.
- WeightsML
- The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.
Rate this article
How helpful did you find this?
- 01
Troubleshooting local LLMs and keeping up
May 15, 2026
- 02
Fine-tuning a model locally
May 12, 2026
- 03
Local agents and tool use
May 8, 2026
- 04
Local RAG and embeddings
May 5, 2026
- 05
Integrating a local LLM into your workflow
May 1, 2026
- 06
Your first local LLM, end to end
April 28, 2026
- 07
Every machine can run a local LLM (here's what fits)
April 24, 2026
- 08
System requirements by OS for local LLMs
April 21, 2026
- 09
Picking a local model by task
April 17, 2026
- 10
Streaming, throughput, and the KV cache
April 14, 2026
- 11
Quantization, distillation, pruning: making models fit
April 10, 2026
- 12
The local-LLM vocabulary
April 7, 2026
- 13
The pitch for local LLMs in 2026
April 3, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...