Prompt caching economics: when the 5-minute TTL pays rent

2026-02-17

Pulled my Anthropic usage dashboard this morning and did a thing I had been putting off for a month: actually read the cache-hit numbers. I had been telling myself caching was "on" because I had flipped the flag. I had not checked whether the cache was actually getting hit.

Turned out roughly 40 percent of my workflow was paying full uncached rate on content that should have been cache-reads. The fix took two hours and the savings are maybe $30-60 a month, which sounds like nothing until you realize it is 60 percent of the lines on a workflow I run 15 times a week.

2026-02-18

Numbers I re-derived so I would stop guessing. Claude Sonnet 4.5 pricing as of early 2026:

Input tokens, uncached: $3 per million
Cache writes (5-minute TTL): $3.75 per million (25 percent premium on first write)
Cache reads (5-minute TTL): $0.30 per million (10 percent of base)
Cache writes (1-hour TTL): $6 per million (100 percent premium)
Cache reads (1-hour TTL): $0.30 per million (still 10 percent of base)

The 5-minute TTL is the default. A 1-hour option exists for workflows that need to persist longer but is expensive on the write. For most agent loops, 5-minute is the right trade.

2026-02-19

Ran the math for a loop I run daily. A Claude Code agent reading a 40K-token working set to do 10 iterations of edits.

Uncached cost: 40,000 input tokens times $3/M times 10 iterations = $1.20 per run.
Cached cost: first iteration is a cache write at $3.75/M (so $0.15). Next 9 are cache reads at $0.30/M (so $0.012 each, $0.108 total). Grand total $0.258 per run.

That is a 78 percent drop on a 10-iteration loop. Across 15 runs a week, the difference is roughly $14 a week, $56 a month. Not earth-shattering in isolation. Multiply by three parallel agents and five workflows and it stops being rounding-error.

2026-02-20

The thing I kept getting wrong: cache writes cost more than uncached reads, but they pay themselves back the first time you get a cache hit. If you write a cache and never hit it, you paid a 25 percent premium for nothing.

This is why the 5-minute TTL is the default. Within 5 minutes, if you are in a workflow that reuses the same context (agent loop, repeated-question Q&A, iterative build), you will hit the cache. If you are not, you probably should not be caching that content.

The 1-hour TTL is tempting but mathematically harder to justify. A 1-hour cache write costs 100 percent more than uncached on the first call. To break even, you need at least 10 cache hits within the hour. If you only hit it 3 or 4 times, you lost money versus just paying uncached rate each time.

2026-02-23

What I cache now, specifically:

Long system prompts. If the system prompt is over 1K tokens and I am going to run more than 2 calls with it in 5 minutes, it gets cached. The prompt is at the front of the message, so caching there is straightforward.
Working-set files. When an agent loads 20K tokens of repo context to do iterative edits, those files get cached. Every subsequent edit in the same 5 minutes is a cache read.
Tool definition blocks. Sub-agent dispatch includes tool definitions. Those cache cleanly and get hit every time the sub-agent loops.

What I do not cache:

One-shot calls. A single question-and-answer interaction has nothing to reuse.
Rapidly changing content. If the working set is being rewritten every call (say, a document being drafted), caching fails because the cache key keeps changing.
Small prompts. Under about 1K tokens, the cache write premium is bigger than any possible savings. Just pay the uncached rate.

2026-02-25

Ran into an interesting edge case. Two agents sharing the same cached system prompt. They both benefit from the cache. Cache keys are tied to content, not to sessions. If Agent A caches a system prompt and Agent B makes a call with the identical prompt within 5 minutes, Agent B gets a cache hit.

This is relevant for parallel agent dispatch. Parallel versus sequential dispatch covers the mechanics. On the caching side: the first parallel agent to start pays the cache-write premium, and the others ride the cache for cache-read prices.

2026-02-28

Had a workflow where caching was not helping and I could not figure out why. Turned out I was rebuilding the context string each iteration with a timestamp embedded in it. Cache key was different every call. Cache miss every call. Spent twenty minutes debugging before I noticed the timestamp.

Rule I added to my checklist: if you think caching is on and you are not seeing the cost drop, check whether the content is actually identical across calls. Even one token of difference blows the cache.

2026-03-02

The 1-hour TTL question keeps coming up. Some workflows really do want longer persistence. Long-running research sessions, multi-step builds that pause for human review. For those, 1-hour makes sense if you can guarantee at least 10 cache hits across the hour.

But I have found that most workflows I thought wanted 1-hour actually just want 5-minute-with-keepalive. A tiny bump call every 4 minutes keeps the cache warm for way cheaper than writing a 1-hour cache. The keepalive is one call with the cached content, no other work, costs a cache read plus a minimal completion.

2026-03-05

The field note I want the version-of-me-from-a-year-ago to have read: caching is a bill optimization, not an architecture change. Turn it on, verify it is hitting, move on. The trap is optimizing cache hit rates before you have verified the cache is doing anything at all.

What I check now, in order:

Is the cache flag set? (Sometimes I forget on new code paths.)
Is the cache actually hitting? (The usage dashboard shows cache-read tokens separately from input tokens. If cache-reads are zero, something is wrong.)
Is the content stable across calls? (Timestamps, session IDs, anything that varies per call will break the cache key.)
Is the content over 1K tokens? (Below that, cache premium eats the savings.)
Am I running more than 2 calls with this content in 5 minutes? (Below that, cache premium again eats savings.)

What the month taught me

Caching is one of the biggest-payoff changes you can make on a Claude Code bill. It is also one of the easiest to think you have done and actually not done. Measure cache hit rate before you optimize anything else about your workflow. The rest of the token budget math assumes you have the caching question answered.

For operators who run agent workflows professionally, the Claude Code skills pack ships with caching patterns pre-wired into the skills that benefit most from it. The broader agent handbook covers the full set of cost optimizations.

FAQ

Is prompt caching on by default in Claude Code?

For most workflows, yes. Claude Code auto-caches the working set and system prompt. If you are using the raw Anthropic SDK, you have to opt in per request by setting cache_control on the content blocks you want cached.

Can I cache tool definitions?

Yes. Tool definitions are part of the request payload and cache the same as any other prompt content. If you have a long tool list that does not change between calls, cache it.

What breaks the cache?

Any change to the cached content. Even a whitespace difference produces a new cache key and a cache miss. Keep cached content deterministic across calls. No timestamps, no session IDs, no user-dependent strings inside the cached block.

Is the 1-hour TTL worth it?

Only if you can guarantee at least 10 cache hits across the hour. For most agent loops, 5-minute with a keepalive call is cheaper than 1-hour writes.

How do I verify caching is working?

Check the Anthropic usage dashboard. Cache-read tokens are reported separately from uncached input tokens. If the cache-read line stays at zero, something is wrong with your request shape.