6 min read

LLM APIs and the economics of tokens

How input vs output tokens are priced, why output is 5-6x more, what prompt caching saves you (10x), and the hidden costs (tokenizer drift, reasoning tokens, tool-call loops) that surprise people.

LLM APIs and the economics of tokens

An LLM API bill is one number multiplied across two prices: tokens in, tokens out. Get used to that and the bill stops being a surprise. Miss it, and you'll get a $4,000 invoice for a feature you assumed cost twenty bucks.

This is post 6 of 7 in the Running series. Previous posts covered the local-runtime side. This one is the cloud-API side: how token billing actually works, why output is more expensive than input, and how prompt caching changes the math by a factor of ten.

What a token actually is

A token is a chunk of text the model processes as one unit. Roughly:

  • 1 token ≈ 4 characters of English.
  • 1 token ≈ 0.75 of an English word.
  • 1,000 tokens ≈ 750 words ≈ a page of text.
  • 1M tokens ≈ a 750-page book.

The exact mapping is the model's tokenizer (BPE, SentencePiece, etc.). Different models tokenize the same text differently. Claude Opus 4.7 ships with a new tokenizer that produces up to 35% more tokens than 4.6 for the same input, meaning the same prompt costs 35% more even though the per-token price didn't change. Always compare like-for-like on the same tokenizer.

For practical purposes:

  • A short chat message is 10–50 tokens.
  • A typical prompt with system instructions is 500–2,000 tokens.
  • A document you're summarizing is usually 5,000–50,000 tokens.
  • A full codebase loaded into context is 100,000–1,000,000 tokens.

Input vs output: why output costs more

Every API charges different prices for input and output. Output is always more expensive. This catches everyone the first time.

A representative table for 2026, per million tokens:

ModelInputOutputOutput multiple
Claude Opus 4.7$5$255x
Claude Sonnet 4.6$3$155x
Claude Haiku 4.5$0.80$45x
GPT-5.5$5$306x
Gemini 3.1 Pro$4$205x
DeepSeek V4$0.30$1.204x

Output is more expensive because of how inference works. Input tokens get processed in a single forward pass through the model: the GPU does it all at once, in parallel, in milliseconds. Output tokens are generated one at a time, autoregressively. Each output token requires a fresh forward pass over the entire context. A 1,000-token output runs the model 1,000 times.

The 5x or 6x ratio reflects that asymmetry, not arbitrary vendor pricing.

What this means for your bills

Two patterns surprise people:

Pattern 1: long context, short answers. You stuff 50,000 tokens of documentation into the prompt and ask for a one-line answer. Looks expensive (50K input) but is actually cheap. 50,000 input tokens at $5/M = $0.25. The 50-token answer is negligible.

Pattern 2: short prompts, long answers. You ask a model to write a 5,000-token report from a 200-token brief. The math flips. 200 input + 5,000 output at $5/$25 per M = $0.001 + $0.125 = $0.126. The output dominates. Across 1,000 reports, that's $126.

Code generation, document drafting, agentic loops, all output-heavy. Q&A over a knowledge base is input-heavy. Watch which mode your workload is in.

Prompt caching: the discount nobody applies

Prompt caching savings

The biggest cost lever in 2026 isn't picking a cheaper model. It's caching the parts of your prompt that don't change.

Most production prompts have a long stable header (the system prompt, tool definitions, retrieved documents, few-shot examples), followed by the user's actual message. The header might be 5,000 tokens that never changes between calls. Without caching, you pay full input price for those 5,000 tokens every single call. With caching, you pay full price the first time and 10% of input price on every subsequent call within the cache window (5 minutes by default on Anthropic, longer on others).

The numbers, on Claude Opus 4.7:

  • Without cache: 5,000-token header at $5/M = $0.025 per call. Across 10,000 calls: $250.
  • With cache: $0.025 once, then 5,000 tokens × $0.50/M = $0.0025 per call. Across 10,000 calls: $25 + $0.025 = $25.

That's a 10x discount on the input portion of your bill. Most production AI apps could shave 50–80% off their total spend by enabling caching on their system prompt. Most don't, because they wrote the integration before caching existed and never went back.

If you're building anything that hits an LLM API more than 100 times a day with a stable header, caching is the first optimization. Not a fancier prompt. Not a smaller model. Caching.

A worked example

A code-review bot. For every PR, it gets:

  • A 4,000-token system prompt (rules, style guide, examples). Stable.
  • A 2,000-token diff. Changes per call.
  • It outputs a 1,500-token review.

100 reviews a day, 30 days a month, on Sonnet 4.6 ($3 input, $15 output, $0.30 cached input).

Without caching:

  • Input: (4,000 + 2,000) × 100 × 30 × $3/M = $54.
  • Output: 1,500 × 100 × 30 × $15/M = $67.50.
  • Total: $121.50/month.

With caching:

  • Cached input: 4,000 × 100 × 30 × $0.30/M = $3.60.
  • Fresh input: 2,000 × 100 × 30 × $3/M = $18.
  • Output: $67.50 (unchanged).
  • Total: $89.10/month.

A 27% cut from one configuration change. The percentage gets bigger as the cached portion grows.

Hidden costs

Three things that don't show up in the headline price:

  • Tokenizer drift. As mentioned, Opus 4.7's tokenizer produces ~35% more tokens than 4.6 on the same English input. Claims of "Hindi/Devanagari is more expensive" are real: non-Latin scripts often tokenize 2–3x worse than English on the same model.
  • Reasoning tokens. Reasoning-mode models (GPT-5.5 high, Claude with extended thinking) generate hidden chain-of-thought tokens before the final answer. Those tokens are billed as output. A "high-reasoning" answer can cost 10x what a normal one does.
  • Tool calls. When the model calls a tool, the tool's response goes back into context as input on the next call. An agent doing 20 tool calls in a loop can rack up 50,000+ input tokens for what looked like one user request.

Always log your actual token counts in production. The bill is downstream of those numbers, and the numbers are usually higher than your back-of-envelope estimate.

Picking a model on cost

A working heuristic for 2026:

  • High-volume, low-stakes work (classification, summarization, simple chat): Haiku 4.5, GPT-5-mini, or DeepSeek V4. Below $1/M output. Bills stay small.
  • Daily driver, real work (coding, drafting, agentic loops): Sonnet 4.6 or GPT-5.4. Around $15/M output. Best price-per-intelligence ratio.
  • Frontier reasoning, hard problems: Opus 4.7 or GPT-5.5 high. $25–30/M output. Pay it when it matters; don't pay it for everything.

The trap is using Opus for tasks Haiku could do. The other trap is using Haiku for tasks that need Opus and burning a week debugging the model's mistakes. Neither saves money.

What's next

So far we've talked about the bill. The other thing leaving your machine when you call an API isn't dollars; it's data. The next post is about what providers actually see, log, and do with the tokens you send them.

From the dictionary

Terms used in this post

Quick reference for the 15 terms you met above. Each one comes from the AI dictionary.

Artificial IntelligenceAI
Umbrella term for software that performs tasks usually associated with human reasoning — language, perception, decision-making. Coined at the 1956 Dartmouth Summer Research Project. In everyday 2026 use, "AI" almost always means a large language model like ChatGPT, Claude, or Gemini, even though the textbook definition is much broader.
e.g. When a product page says "AI-powered", it could mean a 70-billion-parameter LLM or a hand-written if-statement. The label moves with the times.
APIGeneral
Application Programming Interface. In LLM context: the HTTP endpoint a hosted model exposes (api.openai.com, api.anthropic.com). You send JSON, you get tokens back. The cloud-inference contract.
ClaudeAI
Anthropic's family of LLMs (Opus, Sonnet, Haiku) and consumer chat product at claude.ai. Used in this blog's tooling for drafting and dictionary work; also powers Claude Code, the CLI agent.
e.g. This blog's create-post skill drafts inline using Claude.
Context WindowNLP
The maximum number of tokens an LLM can take in for a single forward pass. Everything the model knows about your current conversation has to fit inside this window — anything outside is invisible.
Few-ShotNLP
Showing the LLM a handful of input/output examples in the prompt before the real query, so it picks up the pattern. Cheap and effective; usually the next thing to try after a plain prompt.
GeminiAI
Google's family of LLMs and the consumer chat product at gemini.google.com. Tightly integrated with Google's search index and Workspace apps.
e.g. Gemini is Google's answer to ChatGPT, with native access to Search.
GPTAI
OpenAIs family of large language models — Generative Pre-trained Transformer. GPT-4 (2023) and successors are the most widely used closed-source LLMs in production.
GPUGeneral
A chip built for massive parallel arithmetic. The reason deep learning took off in the 2010s — GPUs make matrix multiplication fast enough to train deep networks in days instead of years. Nvidia dominates the market.
InferenceML
Running a trained model on a new input to produce an output. A single forward pass through the network with frozen weights. Much cheaper than training, which is why every LLM API exists.
Large Language ModelAI
A deep-learning model trained on huge volumes of text to predict the next token given the previous ones. Scaling next-token prediction to billions of parameters yields the chat-like behaviour of ChatGPT, Claude, and Gemini. Capabilities are bounded by training data and the context window.
e.g. Claude is an LLM — it reads your message as tokens and generates a response one token at a time.
ModelML
In ML, a model is a file of learned numbers (parameters or weights) plus an architecture that tells the program how to use them. Loading a model means reading those numbers; running it means doing arithmetic with them.
PromptNLP
The text you send to an LLM. Includes any system prompt, conversation history, retrieved context, and your actual question. The prompt is the only thing you can change without retraining.
Prompt CachingAI
Caching the model state for a stable prefix of a prompt so repeat calls skip recomputing it. Anthropic and OpenAI both expose this via API; cached tokens cost 5-10x less and have a 5-minute TTL on Anthropic. Critical for cost when you reuse system prompts or RAG context across requests.
System PromptNLP
A special instruction at the start of an LLM conversation that sets role, behaviour, format, and constraints. Most "the model isnt doing what I want" problems are solved here, before reaching for RAG or fine-tuning.
TokenNLP
The unit an LLM operates on — roughly a word or piece of one. English averages around 4 characters per token. Tokens are the unit of computation, the unit of API billing, and the unit the context window is measured in.

Rate this article

How helpful did you find this?

Newsletter

Get new articles in your inbox

AI engineering, LLM systems, and software architecture — no filler.

No spam. Unsubscribe any time.

Discussion

Comments

Leave a note about the article, architecture choices, or what you would build next.

Comments are stored in Supabase and fetched per post slug.

Loading comments...