← Blog

Attention

DL

/dictionary/attention

Definition

The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.

Posts that use this term

  • Streaming, throughput, and the KV cache

    TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.

  • Quantization, distillation, pruning: making models fit

    Three ways to shrink an LLM. Quantization (Q2-Q8 with K-quants in GGUF), distillation (teacher to student), pruning. Why Q4_K_M is the community default and what each lever costs.

  • The major LLMs in 2026

    A tour of the closed frontier models (Claude, GPT, Gemini) and the open weights (Llama, Qwen, DeepSeek, Mistral). What 'B' means, what each is good at, and which size to actually run.

  • The context window, and why models hallucinate

    An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway — that's a hallucination, not a bug.

  • From models to LLMs

    An LLM is one kind of ML model — trained on text, predicts the next token. That single trick at scale gets you ChatGPT, and also explains where it breaks.