← Blog

Dataset

Data

/dictionary/dataset

Definition

The collection of examples a model learns from during training. The shape, size, quality, and bias of the dataset determines almost everything about the resulting model.

Posts that use this term

  • Fine-tuning a model locally

    When fine-tuning is the right answer (rarely) and how to do it on consumer hardware: LoRA, QLoRA, MLX-LM, Unsloth. A worked example fine-tuning Llama 3.2 3B on a 16GB Mac.

  • Local RAG and embeddings

    A complete local RAG pipeline in 30 lines: nomic-embed-text for embeddings, Chroma for the vector DB, Llama 3.2 for the chat model. Why local RAG often beats cloud RAG for personal knowledge bases.

  • Picking a local model by task

    The 2026 open leaders by task: coding (Qwen 2.5 Coder, DeepSeek-Coder), chat (Llama, Qwen, Mistral), small-model renaissance (Phi-3, Gemma 2), structured output, multimodal, embeddings.

  • Quantization, distillation, pruning: making models fit

    Three ways to shrink an LLM. Quantization (Q2-Q8 with K-quants in GGUF), distillation (teacher to student), pruning. Why Q4_K_M is the community default and what each lever costs.

  • The context window, and why models hallucinate

    An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway — that's a hallucination, not a bug.

  • How a model learns: training and inference

    Training is the expensive one-time event where a model's numbers get tuned. Inference is the cheap repeated use afterwards. The gap in cost is enormous, and it shapes the whole industry.

  • What makes a model: data and algorithm

    A model is a file of learned numbers, produced by running an algorithm over data. Both ingredients matter, but bad data beats a good algorithm every time.

  • Inside AI: machine learning and deep learning

    Open the AI umbrella. Machine learning is the part that learns from data. Deep learning is ML done with neural networks — and that's where today's models live.