Kundan's Notebook

Troubleshooting local LLMs (and how to keep up after this series)

Local LLMs break in a small number of ways, and once you've hit each one a couple of times you'll diagnose it in seconds. Wrong template? You'll know the smell. OOM? You'll know that too. This post is the catalog: every common failure mode, the fix, and the loop for keeping up once the series ends and you're on your own. This is post 13 of 13 in the Local LLMs series. After this one you've got the full kit: the knowledge, the running stack, and a debugging playbook. OOM (out of memory) This is the one you'll hit most. The model file plus the runtime plus the KV cache added up to more than your VRAM (or unified memory) had to give. Symptom. Ollama or LM Studio errors out with "out of memory," or the model just never finishes loading. On Linux the OOM killer sometimes takes the process outright. On macOS the whole system goes sluggish first, then errors. Diagnosis. The output shows model size and what percentage sits on GPU versus CPU. If a model is 100% GPU but only just fits, the moment you raise context length or start a second model alongside it, you'll OOM. Fixes, in order: Lower context length. ollama run llama3.1:8b defaults to 4K context. Bump it only when you actually need to. With KV cache off, 32K context can eat more memory than the weights themselves. Enable Q8 KV cache (post 4). Set OLLAMAKVCACHETYPE=q80 in your shell profile and restart Ollama. Roughly halves cache memory. Drop to a smaller quant. Q5KM to Q4KM to Q3KS. There's a real quality cost, but it's small at the first step. Drop to a smaller model. Llama 3.1 8B → Llama 3.2 3B. A 3B that fits beats an 8B that doesn't. Close other GPU apps. On Mac, browsers and Electron apps quietly hold VRAM. Quit them. If the OOM hits mid-conversation rather than at startup, that's your KV cache growing past your headroom as the context fills up. Cap n_ctx lower, or turn on cache quant. Slow output (low tok/s) Symptom. A 7B model that should run at 50+ tok/s is crawling along at 5 tok/s. Diagnosis. If you see "100% CPU," or any split where CPU is above 0%, your model isn't fully on the GPU. That's your problem right there. Fixes: GPU not detected. Driver issue. Update your NVIDIA driver and reboot (Windows/Linux). On Mac, Apple Silicon should always get detected. If you're on an Intel Mac, that's the issue. GPU detected but model spilled to CPU. Model plus KV cache went over VRAM. Lower context, lower quant, or pick a smaller model. Wrong runtime backend. On Linux with NVIDIA, make sure CUDA is enabled in your llama.cpp build. On Mac, make sure Metal is enabled. Thermal throttling. Laptop running hot? The GPU clocks itself down. Plug into power, prop the laptop up, put a fan nearby. Sounds silly right up until it fixes your tok/s. Garbage output, infinite repetition Symptom. The model spits gibberish, gets stuck repeating the same phrase, or starts inventing fake user turns. Diagnosis. It's almost always one of two things: the wrong chat template, or a base model being used as an instruct model. Fixes: Verify base vs instruct. If you pulled llama3.1:8b from Ollama, that's the instruct version, you're fine. If you sideloaded a custom GGUF without the Instruct suffix, you might be running the base model, and base models aren't chat-tuned. Mismatched chat template. Ollama auto-detects the template for its own catalog models. For sideloaded ones, check the GGUF metadata with ollama show <model> and override the template in a Modelfile if you need to. Stop tokens missing. The model doesn't know when to stop talking. Set the stop parameter in your API calls. For Llama models it's <|eotid|>. For Qwen it's <|imend|>. If the output is gibberish from the very first token, not just after a few good ones, the file is corrupt or in the wrong format. Re-pull it. "Model not following instructions" Symptom. The model ignores your system prompt, skips the format you asked for, or hands you a general answer when you wanted something specific. Diagnosis. Smaller models drift more. And the fix is almost always prompt engineering, not a different model. Fixes: Move instructions into the user message, not just the system prompt. Smaller open models often weight the system prompt less heavily than instruct-tuned cloud models do. Add explicit examples (few-shot). Two or three concrete input/output examples beat five paragraphs of instructions every time. Use grammar-constrained output for strict formats. llama.cpp's --grammar flag forces JSON, regex-like patterns, and so on. It's slower but reliable. Lower temperature. A 0.7 default is creative. For instruction following, drop it to 0.2-0.3. Tried all of that and it still won't behave? Then you probably picked a model below the size threshold for what you're asking of it. A 3B model can't reliably produce nine-step structured outputs. Grab a 7B or bigger and try again. Tool calls fail or hallucinate Symptom. The model invents tool names, returns malformed JSON, or claims it got results without ever calling a tool. We covered this in post 11. Quick recap of the three fixes: Validate tool names server-side, and return errors that include the actual tool list. Try to parse the JSON, and on failure feed the parse error straight back to the model. Detect repetition and inject a "you already tried this" nudge. None of this is local-specific, but you'll need it more on local than you ever did on Claude. RAG retrieves irrelevant chunks Symptom. Your "ask my notes" pipeline keeps returning documents that have nothing to do with the question. Diagnosis. It's almost always one of these: Same embedding model for index and query. Index with nomic-embed, then switch to bge-large for queries, and you're comparing two different vector spaces. Re-index everything with one model. Chunk size too big or too small. Too big and the chunks get dilute, multiple ideas mashed together. Too small and they lose context. 300-800 tokens is usually the sweet spot. No reranking. First-pass retrieval finds the "loosely related" stuff. A reranker like bge-reranker-base re-scores for "actually answers the question." It costs a second model pass, but the precision jump is huge. "It worked yesterday and doesn't today" Symptom. Same model, same prompt, and suddenly it's different. Common causes: Ollama upgraded. Run ollama --version and check the release notes. Behavior changes between versions are rare, but they're real. Model upgraded. ollama pull llama3.1:8b can quietly pull a newer revision. The :latest tag is mutable. Different KV cache state. Long-running conversations build up context that shifts the responses. Restart the chat. Random seed. The default 0.7 temperature introduces variance. Set temperature: 0 and seed: 42 for reproducible output. For anything production-facing, pin your model versions: ollama pull llama3.1:8b-instruct-q4KM, not llama3.1:8b. Specific quants and versions stay stable. Latest tags drift on you. Disk filling up Symptom. "No space left on device" while you're pulling a model, or just in general. Models pile up fast. A couple of 70B variants and you've got 100+ GB of SSD tied up. Run ollama list now and then, and ollama rm whatever you're not using. On macOS, also check ~/.cache/huggingface/ if you've been using PEFT or transformers directly. That cache grows on its own, separate from Ollama's. Keeping up The local-LLM space moves fast. Here's where I follow it: r/LocalLLaMA on Reddit. The community hub. New models, benchmarks, fresh quants, hardware reports. Skim it weekly. HuggingFace trending models. huggingface.co/models?sort=trending. Filter by GGUF and you'll see what people are actually downloading. Ollama's library page. ollama.com/library. A curated set, updated when something notable drops. Simon Willison's blog. Tracks the field broadly, with concrete reports on actually running things. Papers with Code → LLM Leaderboard. For benchmarks. Take them with salt, benchmarks are gameable. How often does anything meaningful change in 2026? A real new model family every 2-4 weeks. A tooling improvement every week. Most of it isn't actionable for users, and the few things that are tend to be obvious (when Llama 4 dropped, half the ecosystem updated within a month). Me, I check r/LocalLLaMA on Sunday mornings, scan the trending HuggingFace models when I'm bored on a train, and otherwise let the field come to me. The fundamentals (Ollama, Llama, Qwen, the techniques in this series) change slowly. The leaderboard churns every week, but you don't have to ride every wave. A version-pinning template For reproducible local-LLM setups, pin everything: Write these into your project's README. Six months from now, you'll thank yourself. Closing the series That's the whole thing. Thirteen posts, start to finish: Post 1 made the case for going local in 2026. Posts 2-4 covered the vocabulary, compression, and runtime mechanics. Posts 5-7 covered model selection, OS prerequisites, and per-tier hardware guidance. Post 8 walked through the Hello World install. Post 9 wired the model into VS Code, web UIs, and existing apps. Post 10 added local RAG. Post 11 added agents and tool use. Post 12 covered fine-tuning for when you actually need it. This one is the failure-mode catalog and the keeping-up loop. Here's what I want you to walk away with. Every machine can run a useful local LLM today. Not a downgraded version. Not a toy. A real tool. The 16GB MacBook Air, the older Windows laptop, the gaming PC with the second-hand 4060. All of them. So pull a model this week and use it for something real. The cloud APIs aren't going anywhere, they'll be right there when you need them. But spend a month with a local model in your toolbox and "is this worth running locally?" stops being a default you reach for and becomes a question you answer per task. That's the whole point. Go run something.

Read post →

May 15, 2026AI Debugging LLM Local Llms Troubleshooting

Troubleshooting local LLMs (and how to keep up after this series)

Browse

AI Foundations

AI Running

Local Llms

Setup Toolbox