Prompt Caching
AI/dictionary/prompt-caching
Definition
Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
Related terms
Posts that use this term
- Streaming, throughput, and the KV cache
TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.
- LLM APIs and the economics of tokens
How input vs output tokens are priced, why output is 5-6x more, what prompt caching saves you (10x), and the hidden costs (tokenizer drift, reasoning tokens, tool-call loops) that surprise people.
- The runtimes: llama.cpp, Ollama, LM Studio
llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.
- Where AI actually runs: cloud, local, edge
Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.
- Install the Anthropic SDK
Install the Anthropic SDK for Python and Node, configure your API key, and verify with a one-line messages.create call to Claude.