Prompt Caching Is Everything: Applying Claude Code's Lessons to a Multi-Agent Investigation Pipeline
After a 7-agent investigation run came in at $21 and 67 minutes, I audited the pipeline against every recommendation in Thariq Shihipar's caching thread. Here's what changed.
The Article That Started This
On February 19, 2026, Thariq Shihipar from Anthropic's Claude Code team published a thread titled "Lessons from Building Claude Code: Prompt Caching Is Everything". The core thesis: prompt caching is prefix-matched, so static content must come first, dynamic content last, and you should never mutate the prefix mid-session. The Claude Code team runs alerts on cache hit rate and declares SEVs when it drops — that's how critical it is.
I'd been building a multi-agent investigation pipeline using the Claude Agent SDK — a system that orchestrates 7+ specialized sub-agents through a 6-phase workflow (intake, data collection, processing, analysis, report writing, verification). Each investigation produces a 50–70 page LaTeX report. The whole thing runs on Opus with a 1M context window.
After a run came in at $21 and 67 minutes, I decided to audit the pipeline against every recommendation in Thariq's article.
What We Found
The Cost Tracking Bug
Before we could measure improvements, we needed accurate baselines. Our cost tracker was reporting $67 per investigation — triple the real cost.
The root cause: the Claude Agent SDK's ResultMessage.total_cost_usd is
cumulative for the entire session, not per-exchange. Our tracker was summing
these values across 6 phases as if they were independent:
$1.06 + $5.20 + $7.68 + $12.35 + $19.92 + $21.04 = $66.25
# Real cost: $21.04 (the final cumulative value)
Interestingly, only total_cost_usd is cumulative. Other ResultMessage
fields like usage (token counts), duration_ms, and num_turns
are per-exchange. We discovered this empirically — our initial fix applied delta computation
to all fields, which produced negative token counts in the per-phase breakdown. The second
iteration corrected this to only delta-compute cost while using direct values for everything else.
The System Prompt Problem
The lead agent's system prompt was ~10KB and contained a mix of static and dynamic content:
[Static] Role description, phase workflow instructions
[Dynamic] Investigation ID, file paths ← breaks cache here
[Dynamic] Live BigQuery schema, GCS/S3 buckets ← and here
[Static] Sub-agent descriptions
[Static] Coordination rules
Every investigation has a unique ID and output directory. The infrastructure schema changes whenever tables are added or buckets created. This meant the entire system prompt was unique per-run — zero cross-session cache reuse.
Per the article's rule: "Static content first, dynamic content last."
Token Display Was Misleading
The summary showed "Total tokens: 33.5K" but that only counted
input_tokens + output_tokens. The actual token volume was 3.2M+ when you include
cache reads. Without seeing cache tokens in the totals, there was no way to evaluate caching
performance.
Things That Were Already Correct
The tool set was static and identically ordered across all agents (a common cache-busting mistake per the article). Sub-agent system prompts contained no dynamic interpolation. Agent model assignments were fixed (Opus for lead/analyst, Sonnet for collector, etc.). These were all confirmed safe — no changes needed.
The Changes
1. Static System Prompt + Dynamic Session Context
We split the lead agent's system prompt into two parts.
System prompt (fully static, ~8KB): Role description, sub-agent catalog,
phase workflow instructions, coordination rules. No config interpolation, no schema, no paths.
Just a note that says "the first user message contains a <session-context> block."
Session context (dynamic, prepended to first user message): Investigation ID,
all directory paths, full BigQuery schema, GCS bucket listing, S3 bucket listing. Wrapped in
<session-context> tags.
The system prompt now has the same content for every investigation. Combined with the static tool definitions, this gives us a substantial cacheable prefix: system prompt + tools is identical across all runs.
2. Delta-Based Cost Tracking
record_result() now stores the previous total_cost_usd and
subtracts it from the current value to get the per-phase increment. This produces accurate
per-phase cost breakdowns that sum to the true total.
3. Cache-Inclusive Token Display
The summary now shows all token types in the total count and computes a cache hit rate:
cache_hit_rate = cache_read / (cache_read + cache_write + input_tokens)
This gives immediate visibility into caching performance — the single best indicator of whether your prompt architecture is working.
The Experiment
We re-ran the exact same investigation with the exact same prompt — a bug report about
document upload errors for a specific user, covering the same date range and data sources.
Both runs used Opus with context-1m-2025-08-07 beta, identical tool sets, and
the same max_turns=200 / max_budget_usd=50.0 configuration.
Results
| Metric | Before | After | Change |
|---|---|---|---|
| Total cost | ~$21 | $19.85 | −6% |
| Cache hit rate | ~85% | 98.7% | +14 pts |
| Wall-clock time | 67 min | 69.3 min | +3% (noise) |
| Report output | ~70 pages | 70 pages, 709 KB | Comparable |
| Tool calls | ~55 | 55, 0 errors | Same |
Per-Phase Cost Breakdown
| Phase | Cost | Tokens | Turns |
|---|---|---|---|
| Intake | $18.80 | 18.3K | 20 |
| Collect | $0.16 | ~1.0K | 3 |
| Process | $0.13 | ~1.0K | 3 |
| Analyze | $0.26 | ~1.3K | 6 |
| Report | $0.14 | ~1.1K | 2 |
| Verify | $0.37 | ~3.2K | 7 |
How We Measured Cache Performance
The Claude API returns token usage broken down into four categories per request:
input_tokens— Non-cached input tokens (cache miss)output_tokens— Generated output tokenscache_read_input_tokens— Tokens served from cache (cache hit)cache_creation_input_tokens— Tokens written to cache (first-time cost)
Cache hit rate = cache_read / (cache_read + cache_write + input_tokens).
Our 98.7% hit rate means almost the entire system prompt + tool definition prefix was cached
across exchanges within the session. The ~1.3% miss is the new content in each exchange
(phase prompts, conversation history growth).
The $1.15 cost reduction (~6%) is modest but directionally correct. The bigger win is the 14-point cache hit rate improvement, which means we're paying cache-read pricing (90% discount vs. base input pricing) for a much larger fraction of our tokens. At higher volumes or with longer investigations, this compounds.
What We Didn't Change (Yet)
Conversation compaction is the next optimization. The lead agent's conversation grows across 6 phases. By the Report phase, it has accumulated all prior phase outputs. The article discusses cache-safe compaction (summarize but keep the same prefix). We use the 1M context beta, so this hasn't been an issue yet — but it's the logical next step if investigations grow larger.
Sub-agent prompt optimization: Sub-agent prompts were already static (no dynamic interpolation). No changes needed.
Takeaways
Audit before optimizing. Our cost tracking bug would have made any A/B test meaningless. Fix your instrumentation first.
The system prompt split is the highest-leverage change. Moving dynamic content from system prompt to the first user message took ~30 minutes of refactoring and produced a 14-point cache hit rate improvement.
Not everything in ResultMessage behaves the same.
total_cost_usd is cumulative; token counts, turns, and duration are per-exchange.
Test your assumptions empirically.
Cache hit rate should be a first-class metric. If you're building agentic systems on Claude, display it prominently. It's the single best indicator of whether your prompt architecture is working.