← Blog

Weights

ML

/dictionary/weights

Definition

The numbers inside a trained model. They start out random and get adjusted during training until they encode the patterns in the data. "Open weights" means the trained numbers are downloadable; it does not mean the training data or code is open.

Posts that use this term

  • Troubleshooting local LLMs and keeping up

    The catalog of common local-LLM failures: OOM, slow tok/s, garbage output, instruction drift, RAG miss, tool-call hallucination. Plus where to follow the field as it moves.

  • Fine-tuning a model locally

    When fine-tuning is the right answer (rarely) and how to do it on consumer hardware: LoRA, QLoRA, MLX-LM, Unsloth. A worked example fine-tuning Llama 3.2 3B on a 16GB Mac.

  • Local agents and tool use

    Function calling on open models in 2026: which models actually work (Qwen 2.5, Hermes 3, Llama 14B+), why local agents fail when they fail, and how to build defensive scaffolding around them.

  • Picking a local model by task

    The 2026 open leaders by task: coding (Qwen 2.5 Coder, DeepSeek-Coder), chat (Llama, Qwen, Mistral), small-model renaissance (Phi-3, Gemma 2), structured output, multimodal, embeddings.

  • Streaming, throughput, and the KV cache

    TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.

  • Quantization, distillation, pruning: making models fit

    Three ways to shrink an LLM. Quantization (Q2-Q8 with K-quants in GGUF), distillation (teacher to student), pruning. Why Q4_K_M is the community default and what each lever costs.

  • The local-LLM vocabulary

    Parameters, B, dense vs MoE, base vs instruct, tokens, context window, chat template, GGUF, quantization suffixes. After this post you can read any HuggingFace model card.

  • What leaves your machine when you use AI

    What providers actually see, log, and retain when you call an LLM API in 2026. What 'we don't train on your data' really means, free vs paid tier differences, and when local is the only safe option.

  • The runtimes: llama.cpp, Ollama, LM Studio

    llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.

  • Why Apple Silicon punches above its weight on local LLMs

    Unified memory means the GPU sees all of RAM. Why that beats discrete-GPU PCs above 32B parameters, what fits in 16/32/64/128/192GB, and where Apple Silicon still loses.

  • What it takes to run a model on your machine

    Why VRAM is the hard ceiling on local LLMs, what quantization actually does to a model file, and the practical hardware ladder from 8GB laptops to 192GB workstations.

  • The major LLMs in 2026

    A tour of the closed frontier models (Claude, GPT, Gemini) and the open weights (Llama, Qwen, DeepSeek, Mistral). What 'B' means, what each is good at, and which size to actually run.

  • Where AI actually runs: cloud, local, edge

    Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.

  • Prompt, RAG, fine-tune: three ways to shape a model

    Three levers for shaping what an LLM does: prompting (ask better), RAG (give it the right context), fine-tuning (change the weights). What each costs, what each fixes, and how to pick.

  • RAG: giving a model memory it doesn't have

    RAG is the pattern of fetching relevant text from a search system and putting it in the LLM's context window before asking your question. Not magic, not fine-tuning — just better prompts.

  • The context window, and why models hallucinate

    An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway — that's a hallucination, not a bug.

  • How a model learns: training and inference

    Training is the expensive one-time event where a model's numbers get tuned. Inference is the cheap repeated use afterwards. The gap in cost is enormous, and it shapes the whole industry.

  • What makes a model: data and algorithm

    A model is a file of learned numbers, produced by running an algorithm over data. Both ingredients matter, but bad data beats a good algorithm every time.

  • Install LM Studio

    Install LM Studio on macOS, Linux, and Windows. The fastest GUI for running local LLMs — no terminal needed. Includes the local server for OpenAI-compatible API access.

  • Install llama.cpp

    Build llama.cpp from source with Metal or CUDA acceleration. Run a GGUF model with llama-cli. The closest thing to bare-metal local inference.

  • Install Ollama

    Install Ollama on macOS, Linux, and Windows. Pull your first model, run it locally, and verify with ollama list. The fastest path to a local LLM.