Streaming, throughput, and the KV cache
TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.
Blog
Posts on AI engineering, LLM systems, and software development.
TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.
Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.
Training is the expensive one-time event where a model's numbers get tuned. Inference is the cheap repeated use afterwards. The gap in cost is enormous, and it shapes the whole industry.