Local RAG and embeddings
A complete local RAG pipeline in 30 lines: nomic-embed-text for embeddings, Chroma for the vector DB, Llama 3.2 for the chat model. Why local RAG often beats cloud RAG for personal knowledge bases.

A local LLM doesn't know your notes folder, your codebase, your company wiki, or last year's email. Retrieval-Augmented Generation is how you fix that, entirely on your machine, in about an afternoon.
This is post 10 of 13 in the Local LLMs series. After this one, you'll have a working "ask my notes folder a question" stack with no cloud dependency.
The two ideas behind RAG

The chat model has a context window , typically 8K to 128K tokens. Anything not in that window, the model doesn't know. So if you want it to answer questions about a 200-page document, you can't just paste the document.
RAG splits the problem in two:
- Embed. A small model (an "embedding model") turns each chunk of your document into a vector (a list of, say, 768 numbers). Two chunks with similar meaning end up with similar vectors.
- Retrieve. When you ask a question, embed the question into the same kind of vector. Find the document chunks whose vectors are closest. Stuff those chunks into the chat model's prompt as context. Let the chat model answer using them.
That's the entire idea. The chat model is doing what it always does (answering with what's in its context); the trick is putting the right context in.
The all-local stack
Three components, all running on your machine:
- Embedding model.
nomic-embed-text(137M parameters, ~280 MB at FP16). Runs in Ollama. Apache 2.0. - Vector database. Stores embeddings and answers nearest-neighbor queries. Many options; we'll use Chroma for simplicity.
- Chat model. Llama 3.2 3B, Qwen 2.5 7B, or whatever you picked from post 5.
Disk footprint: ~3 GB total. Memory at runtime: ~6 GB. Runs comfortably on every tier from post 7 above the 8GB minimum.
Step 1: pull the embedding model
# pull the embeddings model
ollama pull nomic-embed-text
Verify it works:
# request an embedding for a sentence
curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"local llms are fast"}'
You should get back a JSON response with an embedding field containing 768 numbers. Those 768 numbers are the vector representation of that sentence.
Step 2: pick a vector database
Three reasonable choices:
- Chroma. Pure Python, embedded (no separate server), simplest. Pick this for first projects.
- LanceDB. Embedded, very fast, columnar storage on disk. Pick when your dataset is bigger or you want SQL-like queries.
- Qdrant. Standalone server, scales to millions of vectors, has its own filtering language. Pick when you're building a real product.
For this walkthrough, Chroma:
# install chroma in a python venv
pip install chromadb
Step 3: build a small RAG pipeline
A complete working script. Indexes a folder of markdown files and lets you query them:
# rag-demo.py
import os
import glob
import requests
import chromadb
OLLAMA = "http://localhost:11434"
EMB_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2:3b"
def embed(text):
r = requests.post(f"{OLLAMA}/api/embeddings",
json={"model": EMB_MODEL, "prompt": text})
return r.json()["embedding"]
def chunk(text, size=500):
words = text.split()
return [" ".join(words[i:i+size]) for i in range(0, len(words), size)]
def index(folder, collection):
for path in glob.glob(f"{folder}/**/*.md", recursive=True):
with open(path) as f:
for i, c in enumerate(chunk(f.read())):
collection.add(
documents=[c],
embeddings=[embed(c)],
metadatas=[{"source": path, "chunk": i}],
ids=[f"{path}-{i}"],
)
def ask(q, collection):
hits = collection.query(query_embeddings=[embed(q)], n_results=4)
context = "\n\n".join(hits["documents"][0])
prompt = f"Answer using only the context below.\n\nContext:\n{context}\n\nQuestion: {q}"
r = requests.post(f"{OLLAMA}/api/generate",
json={"model": CHAT_MODEL, "prompt": prompt, "stream": False})
return r.json()["response"]
if __name__ == "__main__":
client = chromadb.PersistentClient(path="./rag-store")
coll = client.get_or_create_collection("notes")
if coll.count() == 0:
index("/path/to/your/notes", coll)
print(ask("what did I write about kafka offset management?", coll))
About 30 lines for a working RAG pipeline. Replace /path/to/your/notes with your actual notes folder. First run indexes; subsequent runs are instant.
Step 4: making it good
The 30-line version works. The version that consistently gives good answers needs more care:
- Chunk size. 500 words is okay for prose, terrible for code or tables. For code, chunk by function or by file. For documents with structure, chunk by section header.
- Chunk overlap. Add ~50-word overlap between adjacent chunks. Otherwise, an answer that straddles two chunks gets mangled.
- Top-k. I used
n_results=4above. For longer documents, try 6–8. More chunks = more context but more noise. - Reranking. A second model that re-scores retrieved chunks for actual relevance. The retrieval model is good at "loosely related"; rerankers are good at "exactly answers the question."
bge-reranker-baseis a small open option. - Metadata filters. If your notes have dates, tags, or sources, filter retrieval by metadata before embedding similarity. Massively improves precision.
A production-quality RAG pipeline is 1000+ lines of code dealing with these details. The 30-line version above is the starting point.
Choosing an embedding model
Three open options worth knowing:
- nomic-embed-text v1.5. 137M parameters. Fast, small, works for most text. Apache 2.0. Default pick.
- bge-large-en-v1.5. 335M. Higher quality for English. Slightly slower. MIT.
- bge-m3. 569M. Multilingual (100+ languages including Hindi, Tamil, Bengali). Best for non-English content.
All three run in Ollama:
# pull bge multilingual for non-english content
ollama pull bge-m3
Switch between them by changing the EMB_MODEL variable in the script above.
Common things that go wrong
"My RAG retrieves the wrong chunks." Check that you're using the same embedding model for indexing and for queries. Mismatch produces nonsense.
"The chat model isn't using the context." Tighten the prompt: explicitly tell it "answer only using the context below; if the answer isn't there, say so." Smaller local models drift back to their training data without strong steering.
"The model hallucinates citations." Add the source paths to the prompt, and ask for inline citations. Then verify them yourself , local 3B models will sometimes fabricate sources.
"Indexing is slow." Embedding 10,000 chunks at 137M parameters takes minutes, not seconds. Index once, store, reuse. Don't re-embed on every query.
Why local RAG matters
You can do RAG against OpenAI or Claude APIs. Why bother locally?
- Privacy. Your notes, your codebase, your medical records , never leave your machine. The cloud RAG path sends every chunk that gets retrieved.
- Cost. Embedding 50,000 chunks via OpenAI costs around $5–10. Locally, it's free after the laptop. Indexing is the most token-heavy part of RAG; locally that cost vanishes.
- Speed at small scales. No network round trip. Querying a 50K-chunk store with a local embedding model is 50–100ms end-to-end.
- Offline. Your notes work in airplane mode.
For knowledge bases under a few million chunks, local RAG is honestly better than cloud RAG on most axes.
What's next
The model can answer using your data. The next piece is letting it take actions: function calling and tool use on open models. That's the bridge from "smart chat" to "agent."
From the dictionary
Terms used in this post
Quick reference for the 13 terms you met above. Each one comes from the AI dictionary.
- ClaudeAI
- Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
- e.g. This blog's create-post skill drafts inline using Claude.
- Context WindowNLP
- The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
- DatasetData
- The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
- EmbeddingNLP
- A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
- Large Language ModelAI
- A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
- e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
- ModelML
- In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
- OllamaAI
- A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
- ParametersML
- The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
- PromptNLP
- The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
- RAGNLP
- Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
- TokenNLP
- The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
- TrainingML
- The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
- Vector DatabaseData
- A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.
Rate this article
How helpful did you find this?
- 01
Troubleshooting local LLMs and keeping up
May 15, 2026
- 02
Fine-tuning a model locally
May 12, 2026
- 03
Local agents and tool use
May 8, 2026
- 04
Local RAG and embeddings
May 5, 2026
- 05
Integrating a local LLM into your workflow
May 1, 2026
- 06
Your first local LLM, end to end
April 28, 2026
- 07
Every machine can run a local LLM (here's what fits)
April 24, 2026
- 08
System requirements by OS for local LLMs
April 21, 2026
- 09
Picking a local model by task
April 17, 2026
- 10
Streaming, throughput, and the KV cache
April 14, 2026
- 11
Quantization, distillation, pruning: making models fit
April 10, 2026
- 12
The local-LLM vocabulary
April 7, 2026
- 13
The pitch for local LLMs in 2026
April 3, 2026
Newsletter
Get new articles in your inbox
AI engineering, LLM systems, and software architecture — no filler.
No spam. Unsubscribe any time.
Discussion
Comments
Leave a note about the article, architecture choices, or what you would build next.
Loading comments...