Attention

/dictionary/attention

Definition

The mechanism inside a transformer that lets the model look at every previous token in the sequence and weigh its relevance. Attention scales quadratically with sequence length, which is why long context is expensive.

Posts that use this term

Streaming, throughput, and the KV cache
Why TTFT and tok/s are different numbers, why streaming feels faster than it is, and the KV cache that makes the 1000th token cost about the same as the first.
Quantization, distillation, pruning: how a 140GB model fits on your laptop
Three ways to shrink an LLM, and why one of them does almost all the work. What Q4_K_M actually means and what each shortcut costs you.
The major LLMs in 2026
A field guide to the closed frontier models and the open weights you can actually run. What the "B" numbers mean, and which size fits your machine.
The context window, and why models hallucinate
An LLM only sees a fixed-size slice of text at a time. When it doesn't know something, it predicts anyway. That's a hallucination, not a bug.
From models to LLMs
An LLM is one kind of ML model, trained on text, predicts the next token. That single trick at scale gets you ChatGPT, and also explains where it breaks.