← Blog

Embedding

NLP

/dictionary/embedding

Definition

A list of numbers (a vector) that represents the meaning of a piece of text. Two pieces of text with similar meanings have embeddings close together in space. The basis of vector search and most modern retrieval.

Posts that use this term

  • Troubleshooting local LLMs and keeping up

    The catalog of common local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, RAG miss, tool-call hallucination. Plus where to follow the field as it moves.

  • Local agents and tool use

    Function calling on open models in 2026: which models actually work (Qwen 2.5, Hermes 3, Llama 14B+), why local agents fail when they fail, and how to build defensive scaffolding around them.

  • Local RAG and embeddings

    A complete local RAG pipeline in 30 lines: nomic-embed-text for embeddings, Chroma for the vector DB, Llama 3.2 for the chat model. Why local RAG often beats cloud RAG for personal knowledge bases.

  • Integrating a local LLM into your workflow

    Wire your local LLM into VS Code (Continue, Cline), web UIs (Open WebUI, LibreChat, Page Assist), and your own apps via the OpenAI-compatible API. The swap-cloud-for-local pattern in real codebases.

  • Every machine can run a local LLM (here's what fits)

    Per-tier guide: 8GB integrated graphics, 16GB MacBook Air, 8/12/16/24/32GB VRAM PCs, 24/32/64/128/192GB Macs. Specific models, specific tok/s, specific configs. Every tier runs something useful.

  • Picking a local model by task

    The 2026 open leaders by task: coding (Qwen 2.5 Coder, DeepSeek-Coder), chat (Llama, Qwen, Mistral), small-model renaissance (Phi-3, Gemma 2), structured output, multimodal, embeddings.

  • Streaming, throughput, and the KV cache

    TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.

  • The local-LLM vocabulary

    Parameters, B, dense vs MoE, base vs instruct, tokens, context window, chat template, GGUF, quantization suffixes. After this post you can read any HuggingFace model card.

  • The runtimes: llama.cpp, Ollama, LM Studio

    llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.

  • Prompt, RAG, fine-tune: three ways to shape a model

    Three levers for shaping what an LLM does: prompting (ask better), RAG (give it the right context), fine-tuning (change the weights). What each costs, what each fixes, and how to pick.

  • RAG: giving a model memory it doesn't have

    RAG is the pattern of fetching relevant text from a search system and putting it in the LLM's context window before asking your question. Not magic, not fine-tuning — just better prompts.