Local LLMs break in a small number of recognizable ways. Once you've hit each error a couple of times, the diagnostic step takes seconds. This post is the catalog of failure modes and the loop of how to keep up after the series ends. This is post 13 of 13 in the Local LLMs series. After this one, you have the full toolkit: knowledge, hands-on stack, and a debugging playbook. OOM (out of memory) By far the most common failure. The model file plus runtime plus KV cache exceeded VRAM (or unified memory). Symptom. Ollama or LM Studio errors out with "out of memory" or the model never finishes loading. On Linux, sometimes the Linux OOM killer takes the process. On macOS, the system gets sluggish before erroring. Diagnosis. The output shows model size and percentage on GPU vs CPU. If a model is 100% GPU but barely fits, raising context length or running another model alongside will OOM. Fixes, in order: Lower context length. ollama run llama3.1:8b defaults to 4K context; bump only if you need it. With KV cache off, 32K context can cost more memory than the weights. Enable Q8 KV cache (post 4). Set OLLAMAKVCACHETYPE=q80 in your shell profile and restart Ollama. Roughly halves cache memory. Drop to a smaller quant. Q5KM to Q4KM to Q3KS. Quality cost is real but small at the first step. Drop to a smaller model. Llama 3.1 8B → Llama 3.2 3B. Better than running an 8B that doesn't fit. Close other GPU apps. On Mac, browsers and Electron apps quietly use VRAM. Quit them. If OOM happens mid-conversation but not at startup, the KV cache is growing past your headroom as context fills up. Cap n_ctx lower or enable cache quant. Slow output (low tok/s) Symptom. A 7B model that should run at 50+ tok/s is doing 5 tok/s. Diagnosis. If "100% CPU" or some split with CPU > 0%, your model isn't fully on GPU. That's the issue. Fixes: GPU not detected. Driver issue. Update NVIDIA driver and reboot (Windows/Linux). On Mac, Apple Silicon should always be detected , if you're on Intel Mac, that's the issue. GPU detected but model spilled to CPU. Model + KV cache exceeded VRAM. Lower context, lower quant, or pick a smaller model. Wrong runtime backend. On Linux with NVIDIA, make sure CUDA is enabled in your llama.cpp build. On Mac, make sure Metal is enabled. Thermal throttling. Laptop hot? GPU clocks down. Plug into power, prop the laptop up, run with a fan nearby. Sounds silly until it solves your problem. Garbage output, infinite repetition Symptom. Model outputs gibberish, gets stuck repeating the same phrase, or starts generating fake user turns. Diagnosis. Almost always one of two issues: wrong chat template, or base model used as instruct. Fixes: Verify base vs instruct. If you pulled llama3.1:8b from Ollama, that's the instruct version. If you sideloaded a custom GGUF without the Instruct suffix, you might have the base. Base models are not chat-tuned. Mismatched chat template. Ollama auto-detects template for its catalog models. For sideloaded models, check the GGUF metadata (ollama show <model>) and override the template in a Modelfile if needed. Stop tokens missing. The model doesn't know when to halt. Set stop parameter in API calls. For Llama models: <|eotid|>. For Qwen: <|imend|>. If output is gibberish from token 1 (not just after a few good tokens), the file is corrupt or the wrong format. Re-pull. "Model not following instructions" Symptom. Model ignores your system prompt, doesn't follow the format you asked for, gives general answers when you wanted specific. Diagnosis. Smaller models drift more. The fix is almost always prompt engineering, not a different model. Fixes: Move instructions into the user message, not just the system prompt. Smaller open models often weight system prompts less heavily than instruct-tuned cloud models do. Add explicit examples (few-shot). Two or three concrete input/output examples beats five paragraphs of instructions. Use grammar-constrained output for strict formats. llama.cpp's --grammar flag forces JSON, regex-like patterns, etc. Slower but reliable. Lower temperature. A 0.7 default is creative; for instruction following, drop to 0.2–0.3. If you've tried all of those and it still won't follow instructions, you likely picked a model below the size threshold for what you're asking. A 3B model can't reliably do nine-step structured outputs. Pick a 7B+ and try again. Tool calls fail or hallucinate Symptom. Model invents tool names, returns malformed JSON, or claims results without calling tools. Covered in post 11. Quick recap of the three fixes: Validate tool names server-side, return errors with the actual tool list. Try-parse JSON, on failure feed the parse error back. Detect repetition and inject "you already tried this" guidance. These aren't local-LLM specific, but they're more necessary on local than on Claude. RAG retrieves irrelevant chunks Symptom. Your "ask my notes" pipeline returns documents that look unrelated to the question. Diagnosis. Almost always one of: Same embedding model for index and query. If you indexed with nomic-embed and switched to bge-large for queries, you're comparing different vector spaces. Re-index everything with one model. Chunk size too big or too small. Too big and chunks are dilute (multiple ideas). Too small and they lack context. 300–800 tokens is usually right. No reranking. The first-pass retrieval finds "loosely related" , a reranker like bge-reranker-base re-scores for "actually answers the question." Adds a second model pass but huge precision improvement. "It worked yesterday and doesn't today" Symptom. Same model, same prompt, suddenly different. Common causes: Ollama upgraded. Run ollama --version, check release notes. Behavior changes between versions are rare but real. Model upgraded. ollama pull llama3.1:8b may pull a newer revision. The :latest tag is mutable. Different KV cache state. Long-running conversations build up context that affects responses. Restart the chat. Random seed. Default temperature 0.7 introduces variance. Set temperature: 0 and seed: 42 for reproducible output. For production stability, pin model versions: ollama pull llama3.1:8b-instruct-q4KM not llama3.1:8b. Specific quants and versions are stable; latest tags drift. Disk filling up Symptom. "No space left on device" while pulling a model, or in general. Models accumulate fast. A few 70B variants and your SSD has 100+ GB tied up. Periodically ollama list and ollama rm what you don't use. On macOS, also check ~/.cache/huggingface/ if you've been using PEFT or transformers directly. That cache grows separately from Ollama's. Keeping up The local-LLM space moves quickly. Where to follow: r/LocalLLaMA on Reddit. The community hub. New models, benchmarks, Quants released, hardware reports. Skim weekly. HuggingFace trending models. huggingface.co/models?sort=trending. Filter by GGUF and you'll see what people are actually downloading. Ollama's library page. ollama.com/library. Curated set, updated when notable models drop. Simon Willison's blog. Tracks the field generally, with concrete reports on running things. Papers with Code → LLM Leaderboard. For benchmarks. Take with salt; benchmarks are gameable. Frequency of meaningful change in 2026: a real new model family every 2–4 weeks. A tooling improvement every week. Most of it isn't actionable for users, but the few things that are tend to be obvious (e.g., when Llama 4 dropped, half the ecosystem updated within a month). I check r/LocalLLaMA on Sunday mornings, scan the trending HuggingFace models when I'm bored on a train, and otherwise let the field come to me. The fundamentals (Ollama, Llama, Qwen, the techniques in this series) change slowly. The model leaderboard changes every week, but you don't need to ride every wave. A version-pinning template For reproducible local-LLM setups, pin everything: Document these in your project's README. Six months from now, you'll thank yourself. Closing the series That's the entire series. Thirteen posts, end to end: Post 1 made the case for local in 2026. Posts 2–4 covered the vocabulary, compression, and runtime mechanics. Posts 5–7 covered model selection, OS prerequisites, and per-tier hardware guidance. Post 8 walked through the Hello World install. Post 9 wired the model into VS Code, web UIs, and existing apps. Post 10 added local RAG. Post 11 added agents and tool use. Post 12 covered fine-tuning when you actually need it. This post is the failure-mode catalog and how to keep up. The thing I want you to leave with: every machine can run a useful local LLM today. Not a downgraded version. Not a toy. A real tool. The 16GB MacBook Air, the older Windows laptop, the gaming PC with the second-hand 4060. All of them. Pull a model this week. Use it for something real. The cloud APIs aren't going anywhere , they'll still be there when you need them. But once you've spent a month with a local model in your toolbox, the question of "is this worth running locally?" becomes one you can answer per task, not as a default.
Kundan's Notebook
Notes from a working engineer: mostly code and AI, sometimes credit cards
Tuesday, June 2, 2026 · 38 posts
Browse
All series →Software engineering — AI, JS/TS, React Native, DevOps, shipping real things.
AI Foundations
8AI explained from zero. What AI, ML, and deep learning actually are, what a model is, how training and inference differ, and the real meaning of LLMs, context windows, RAG, and fine-tuning.
AI Running
7Where AI actually runs and what it costs. Cloud vs local vs edge, the major models in 2026, what your laptop needs to host them, the runtimes, API token economics, and what really leaves your machine.
Local Llms
13Run a useful LLM on the laptop you already have. Hardware tiers, model picks, runtimes, RAG and tools, fine-tuning, and how to debug it when it breaks.
Setup Toolbox
10One-stop install guides for the CLI tools every other post on this blog assumes you have. Cross-platform: macOS, Linux, Windows. Each post covers install, configure, verify, and the gotchas.
