Prompt Caching

/dictionary/prompt-caching

Definition

Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.

Related terms

Posts that use this term

Streaming, throughput, and the KV cache
Why TTFT and tok/s are different numbers, why streaming feels faster than it is, and the KV cache that makes the 1000th token cost about the same as the first.
LLM API bills, and why a token costs what it costs
How input and output tokens get priced, why output runs 5-6x more, and how prompt caching cuts the input bill by 10x. Plus the hidden costs that ambush people.
The runtimes: llama.cpp, Ollama, LM Studio
llama.cpp is the engine. Ollama and LM Studio wrap it. What each one does, when to reach for which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.
Where AI actually runs: cloud, local, edge
When you use AI, a model file is sitting on a real machine. There are only three places it can be, and which one decides almost everything else.
Install the Anthropic SDK
Install the official Claude SDK for Python and Node, set your API key the safe way, and prove it works with a one-line call.