← Blog

Prompt Caching

AI

/dictionary/prompt-caching

Definition

Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.

Related terms

Posts that use this term

  • Streaming, throughput, and the KV cache

    TTFT vs tok/s, why streaming feels faster, and the KV cache that makes the 1000th token cost the same as the first. KV cache quantization (Q8/Q4 KV) and why it should be your default.

  • LLM APIs and the economics of tokens

    How input vs output tokens are priced, why output is 5-6x more, what prompt caching saves you (10x), and the hidden costs (tokenizer drift, reasoning tokens, tool-call loops) that surprise people.

  • The runtimes: llama.cpp, Ollama, LM Studio

    llama.cpp is the engine; Ollama and LM Studio wrap it. What each does, when to pick which, and why the OpenAI-compatible APIs are mostly but not entirely interchangeable.

  • Where AI actually runs: cloud, local, edge

    Where the model file actually sits when you use AI: a datacenter GPU (cloud), your own machine (local), or the device's silicon (edge). The trade-offs and how to pick.

  • Install the Anthropic SDK

    Install the Anthropic SDK for Python and Node, configure your API key, and verify with a one-line messages.create call to Claude.