RAG
NLP/dictionary/rag
Definition
Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
Posts that use this term
- Troubleshooting local LLMs and keeping up
The catalog of common local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, RAG miss, tool-call hallucination. Plus where to follow the field as it moves.
- Fine-tuning a model locally
When fine-tuning is the right answer (rarely) and how to do it on consumer hardware: LoRA, QLoRA, MLX-LM, Unsloth. A worked example fine-tuning Llama 3.2 3B on a 16GB Mac.
- Local agents and tool use
Function calling on open models in 2026: which models actually work (Qwen 2.5, Hermes 3, Llama 14B+), why local agents fail when they fail, and how to build defensive scaffolding around them.
- Local RAG and embeddings
A complete local RAG pipeline in 30 lines: nomic-embed-text for embeddings, Chroma for the vector DB, Llama 3.2 for the chat model. Why local RAG often beats cloud RAG for personal knowledge bases.
- Integrating a local LLM into your workflow
Wire your local LLM into VS Code (Continue, Cline), web UIs (Open WebUI, LibreChat, Page Assist), and your own apps via the OpenAI-compatible API. The swap-cloud-for-local pattern in real codebases.
- Every machine can run a local LLM (here's what fits)
Per-tier guide: 8GB integrated graphics, 16GB MacBook Air, 8/12/16/24/32GB VRAM PCs, 24/32/64/128/192GB Macs. Specific models, specific tok/s, specific configs. Every tier runs something useful.
- Picking a local model by task
The 2026 open leaders by task: coding (Qwen 2.5 Coder, DeepSeek-Coder), chat (Llama, Qwen, Mistral), small-model renaissance (Phi-3, Gemma 2), structured output, multimodal, embeddings.
- The local-LLM vocabulary
Parameters, B, dense vs MoE, base vs instruct, tokens, context window, chat template, GGUF, quantization suffixes. After this post you can read any HuggingFace model card.
- The pitch for local LLMs in 2026
Why every engineer should run a local LLM in 2026: privacy, zero marginal cost, lower latency, no rate limits, and offline. Even a 16GB MacBook Air runs Llama 3.2 3B at 30 tok/s.
- Prompt, RAG, fine-tune: three ways to shape a model
Three levers for shaping what an LLM does: prompting (ask better), RAG (give it the right context), fine-tuning (change the weights). What each costs, what each fixes, and how to pick.
- RAG: giving a model memory it doesn't have
RAG is the pattern of fetching relevant text from a search system and putting it in the LLM's context window before asking your question. Not magic, not fine-tuning — just better prompts.
- The context window, and why models hallucinate
An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway — that's a hallucination, not a bug.