May 5, 20265 min read

Local RAG and embeddings

Build a working local RAG pipeline in about 30 lines using nomic-embed-text, Chroma, and Llama 3.2. And why running it on your own machine beats the cloud for personal notes.

Your local LLM has never seen your notes folder. It doesn't know your codebase, your company wiki, or what landed in your inbox last year. Retrieval-Augmented Generation is how you fix that, and you can do the whole thing on your own machine in an afternoon.

This is post 10 of 13 in the Local LLMs series. By the end you'll have a working "ask my notes folder a question" stack with nothing touching the cloud.

The two ideas behind RAG

Local RAG pipeline

Start with the limit. The chat model has a context window, usually somewhere between 8K and 128K tokens. Anything outside that window may as well not exist. So you can't just paste a 200-page document and ask questions about it. It won't fit.

RAG splits the problem in two.

Embed. A small model (the "embedding model") turns each chunk of your document into a vector, a list of maybe 768 numbers. Two chunks that mean similar things land on similar vectors.
Retrieve. When a question comes in, embed it into the same kind of vector. Find the document chunks whose vectors sit closest. Drop those chunks into the chat model's prompt as context, and let it answer from them.

That's the whole trick. The chat model keeps doing its usual job, answering with whatever is in its context. You're just making sure the right stuff is in there.

The all-local stack

Three pieces, all running on your machine.

Embedding model. nomic-embed-text (137M parameters, ~280 MB at FP16). Runs in Ollama. Apache 2.0.
Vector database. Stores the embeddings and answers nearest-neighbor queries. Plenty of options here; we'll use Chroma because it's the simplest.
Chat model. Llama 3.2 3B, Qwen 2.5 7B, or whatever you settled on back in post 5.

Disk footprint is around 3 GB total. Memory at runtime is around 6 GB. That runs comfortably on every tier from post 7 that clears the 8GB floor.

Step 1: pull the embedding model

# pull the embeddings model

ollama pull nomic-embed-text

Check it works:

# request an embedding for a sentence

curl http://localhost:11434/api/embeddings -d '{"model":"nomic-embed-text","prompt":"local llms are fast"}'

Back comes a JSON response with an embedding field holding 768 numbers. That's the vector for that sentence.

Step 2: pick a vector database

Three reasonable picks.

Chroma. Pure Python, embedded, no separate server to run. The simplest of the three. Use it for your first projects.
LanceDB. Also embedded, very fast, columnar storage on disk. Reach for it when your dataset gets bigger or you want SQL-like queries.
Qdrant. A standalone server that scales to millions of vectors and has its own filtering language. This is the one when you're building a real product.

For this walkthrough, Chroma:

# install chroma in a python venv

pip install chromadb

Step 3: build a small RAG pipeline

Here's a complete working script. It indexes a folder of markdown files and lets you query them:

# rag-demo.py

import os
import glob
import requests
import chromadb

OLLAMA = "http://localhost:11434"
EMB_MODEL = "nomic-embed-text"
CHAT_MODEL = "llama3.2:3b"

def embed(text):
    r = requests.post(f"{OLLAMA}/api/embeddings",
                      json={"model": EMB_MODEL, "prompt": text})
    return r.json()["embedding"]

def chunk(text, size=500):
    words = text.split()
    return [" ".join(words[i:i+size]) for i in range(0, len(words), size)]

def index(folder, collection):
    for path in glob.glob(f"{folder}/**/*.md", recursive=True):
        with open(path) as f:
            for i, c in enumerate(chunk(f.read())):
                collection.add(
                    documents=[c],
                    embeddings=[embed(c)],
                    metadatas=[{"source": path, "chunk": i}],
                    ids=[f"{path}-{i}"],
                )

def ask(q, collection):
    hits = collection.query(query_embeddings=[embed(q)], n_results=4)
    context = "\n\n".join(hits["documents"][0])
    prompt = f"Answer using only the context below.\n\nContext:\n{context}\n\nQuestion: {q}"
    r = requests.post(f"{OLLAMA}/api/generate",
                      json={"model": CHAT_MODEL, "prompt": prompt, "stream": False})
    return r.json()["response"]

if __name__ == "__main__":
    client = chromadb.PersistentClient(path="./rag-store")
    coll = client.get_or_create_collection("notes")
    if coll.count() == 0:
        index("/path/to/your/notes", coll)
    print(ask("what did I write about kafka offset management?", coll))

That's about 30 lines for a working RAG pipeline. Swap /path/to/your/notes for your real notes folder. The first run indexes everything; every run after that is instant.

Step 4: making it good

The 30-line version works. The version that gives good answers every single time takes more care.

Chunk size. 500 words is fine for prose and awful for code or tables. For code, chunk by function or by file. For structured docs, chunk by section header.
Chunk overlap. Add a ~50-word overlap between neighboring chunks. Without it, an answer that straddles two chunks gets sliced in half.
Top-k. I used n_results=4 above. On longer documents, try 6 to 8. More chunks means more context, but also more noise.
Reranking. A second model re-scores the retrieved chunks for real relevance. The retrieval model is good at "loosely related." Rerankers are good at "actually answers the question." bge-reranker-base is a small open option.
Metadata filters. If your notes carry dates, tags, or sources, filter on metadata before you ever compare embeddings. It improves precision a lot.

A production-quality RAG pipeline is 1000+ lines once you've dealt with all of this. The 30-line version is where you start, not where you stop.

Choosing an embedding model

Three open options worth knowing.

nomic-embed-text v1.5. 137M parameters. Fast, small, fine for most text. Apache 2.0. The default pick.
bge-large-en-v1.5. 335M. Higher quality on English, slightly slower. MIT.
bge-m3. 569M. Multilingual, 100+ languages including Hindi, Tamil, and Bengali. The one for non-English content.

All three run in Ollama:

# pull bge multilingual for non-english content

ollama pull bge-m3

To switch, change the EMB_MODEL variable in the script above.

Common things that go wrong

"My RAG retrieves the wrong chunks." Make sure you used the same embedding model for indexing and for queries. A mismatch produces pure nonsense.

"The chat model isn't using the context." Tighten the prompt. Tell it plainly: answer only from the context below, and if the answer isn't there, say so. Smaller local models wander back to their training data when you don't steer them hard.

"The model hallucinates citations." Put the source paths in the prompt and ask for inline citations. Then check them yourself, because local 3B models will happily invent a source now and then.

"Indexing is slow." Embedding 10,000 chunks at 137M parameters takes minutes, not seconds. Index once, store it, reuse it. Don't re-embed on every query.

Why local RAG matters

You can run RAG against the OpenAI or Claude APIs instead. So why do it locally?

Privacy. Your notes, your codebase, your medical records. None of it leaves the machine. The cloud path ships every chunk that gets retrieved.
Cost. Embedding 50,000 chunks through OpenAI runs about $5 to $10. Locally it's free once you own the laptop. Indexing is the most token-heavy part of RAG, and locally that bill just disappears.
Speed at small scales. No network round trip. Querying a 50K-chunk store with a local embedding model is 50 to 100ms end to end.
Offline. Your notes still answer in airplane mode.

For knowledge bases under a few million chunks, local RAG honestly beats cloud RAG on most of the axes that matter.

What's next

Now the model can answer using your own data. The next piece is letting it do things: function calling and tool use on open models. That's the step from a smart chat box to an actual agent.

AI Chroma Embeddings LLM Local Llms RAG

From the dictionary

Terms used in this post

Quick reference for the 13 terms you met above. Each one comes from the AI dictionary.

ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.; e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
DatasetData: The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.; e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
Vector DatabaseData: A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.

Rate this article

How helpful did you find this?

Series

Local Llms

10 / 13 posts

Browse all in Local Llms →

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Loading comments...

On this page

From the dictionary

Terms in this post

ClaudeAI: Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
Context WindowNLP: The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
DatasetData: The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.
EmbeddingNLP: A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.
Large Language ModelAI: A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
ModelML: In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
OllamaAI: A wrapper around llama.cpp that makes running local LLMs a one-command operation. Pulls quantized GGUF models from a registry, exposes an HTTP API on localhost:11434, and handles model loading/unloading. The most common on-ramp to local inference in 2026.
ParametersML: The individual learned numbers inside a model. "7B parameters" means 7 billion of them. More parameters generally means more capacity, more memory needed, and slower inference.
PromptNLP: The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
RAGNLP: Retrieval-Augmented Generation: search your corpus for relevant text, paste it into the LLMs context window, then ask the question. The models weights are unchanged; only the prompt is augmented.
TokenNLP: The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.
TrainingML: The expensive one-time process of running a learning algorithm over data until the models parameters settle into useful values. Frontier-model training costs $100M+ and tens of thousands of GPUs.
Vector DatabaseData: A database optimised for storing and searching embeddings — finding the K nearest vectors to a query vector. Examples: Pinecone, Weaviate, pgvector. The retrieval engine in most RAG systems.

Series

Local Llms

10 / 13 posts

Browse all in Local Llms →