2026-02-21 #claude #agents #cost 9 min read

Prompt Caching Is Everything: Applying Claude Code's Lessons to a Multi-Agent Investigation Pipeline

After a 7-agent investigation run came in at $21 and 67 minutes, I audited the pipeline against every recommendation in Thariq Shihipar's caching thread. Here's what changed.

The Article That Started This

On February 19, 2026, Thariq Shihipar from Anthropic's Claude Code team published a thread titled "Lessons from Building Claude Code: Prompt Caching Is Everything". The core thesis: prompt caching is prefix-matched, so static content must come first, dynamic content last, and you should never mutate the prefix mid-session. The Claude Code team runs alerts on cache hit rate and declares SEVs when it drops — that's how critical it is.

I'd been building a multi-agent investigation pipeline using the Claude Agent SDK — a system that orchestrates 7+ specialized sub-agents through a 6-phase workflow (intake, data collection, processing, analysis, report writing, verification). Each investigation produces a 50–70 page LaTeX report. The whole thing runs on Opus with a 1M context window.

After a run came in at $21 and 67 minutes, I decided to audit the pipeline against every recommendation in Thariq's article.

What We Found

The Cost Tracking Bug

Before we could measure improvements, we needed accurate baselines. Our cost tracker was reporting $67 per investigation — triple the real cost.

The root cause: the Claude Agent SDK's ResultMessage.total_cost_usd is cumulative for the entire session, not per-exchange. Our tracker was summing these values across 6 phases as if they were independent:

$1.06 + $5.20 + $7.68 + $12.35 + $19.92 + $21.04 = $66.25
# Real cost: $21.04 (the final cumulative value)

Interestingly, only total_cost_usd is cumulative. Other ResultMessage fields like usage (token counts), duration_ms, and num_turns are per-exchange. We discovered this empirically — our initial fix applied delta computation to all fields, which produced negative token counts in the per-phase breakdown. The second iteration corrected this to only delta-compute cost while using direct values for everything else.

The System Prompt Problem

The lead agent's system prompt was ~10KB and contained a mix of static and dynamic content:

[Static] Role description, phase workflow instructions
[Dynamic] Investigation ID, file paths          ← breaks cache here
[Dynamic] Live BigQuery schema, GCS/S3 buckets  ← and here
[Static] Sub-agent descriptions
[Static] Coordination rules

Every investigation has a unique ID and output directory. The infrastructure schema changes whenever tables are added or buckets created. This meant the entire system prompt was unique per-run — zero cross-session cache reuse.

Per the article's rule: "Static content first, dynamic content last."

Token Display Was Misleading

The summary showed "Total tokens: 33.5K" but that only counted input_tokens + output_tokens. The actual token volume was 3.2M+ when you include cache reads. Without seeing cache tokens in the totals, there was no way to evaluate caching performance.

Things That Were Already Correct

The tool set was static and identically ordered across all agents (a common cache-busting mistake per the article). Sub-agent system prompts contained no dynamic interpolation. Agent model assignments were fixed (Opus for lead/analyst, Sonnet for collector, etc.). These were all confirmed safe — no changes needed.

The Changes

1. Static System Prompt + Dynamic Session Context

We split the lead agent's system prompt into two parts.

System prompt (fully static, ~8KB): Role description, sub-agent catalog, phase workflow instructions, coordination rules. No config interpolation, no schema, no paths. Just a note that says "the first user message contains a <session-context> block."

Session context (dynamic, prepended to first user message): Investigation ID, all directory paths, full BigQuery schema, GCS bucket listing, S3 bucket listing. Wrapped in <session-context> tags.

The system prompt now has the same content for every investigation. Combined with the static tool definitions, this gives us a substantial cacheable prefix: system prompt + tools is identical across all runs.

2. Delta-Based Cost Tracking

record_result() now stores the previous total_cost_usd and subtracts it from the current value to get the per-phase increment. This produces accurate per-phase cost breakdowns that sum to the true total.

3. Cache-Inclusive Token Display

The summary now shows all token types in the total count and computes a cache hit rate:

cache_hit_rate = cache_read / (cache_read + cache_write + input_tokens)

This gives immediate visibility into caching performance — the single best indicator of whether your prompt architecture is working.

The Experiment

We re-ran the exact same investigation with the exact same prompt — a bug report about document upload errors for a specific user, covering the same date range and data sources. Both runs used Opus with context-1m-2025-08-07 beta, identical tool sets, and the same max_turns=200 / max_budget_usd=50.0 configuration.

Results

Metric	Before	After	Change
Total cost	~$21	$19.85	−6%
Cache hit rate	~85%	98.7%	+14 pts
Wall-clock time	67 min	69.3 min	+3% (noise)
Report output	~70 pages	70 pages, 709 KB	Comparable
Tool calls	~55	55, 0 errors	Same

Per-Phase Cost Breakdown

Phase	Cost	Tokens	Turns
Intake	$18.80	18.3K	20
Collect	$0.16	~1.0K	3
Process	$0.13	~1.0K	3
Analyze	$0.26	~1.3K	6
Report	$0.14	~1.1K	2
Verify	$0.37	~3.2K	7

Why Intake dominates (~95% of cost): It's where the lead agent makes direct BigQuery queries and reasons about scope. Phases 2–5 are cheap because they delegate to sub-agents whose costs are tracked separately within the same session.

How We Measured Cache Performance

The Claude API returns token usage broken down into four categories per request:

input_tokens — Non-cached input tokens (cache miss)
output_tokens — Generated output tokens
cache_read_input_tokens — Tokens served from cache (cache hit)
cache_creation_input_tokens — Tokens written to cache (first-time cost)

Cache hit rate = cache_read / (cache_read + cache_write + input_tokens). Our 98.7% hit rate means almost the entire system prompt + tool definition prefix was cached across exchanges within the session. The ~1.3% miss is the new content in each exchange (phase prompts, conversation history growth).

The $1.15 cost reduction (~6%) is modest but directionally correct. The bigger win is the 14-point cache hit rate improvement, which means we're paying cache-read pricing (90% discount vs. base input pricing) for a much larger fraction of our tokens. At higher volumes or with longer investigations, this compounds.

What We Didn't Change (Yet)

Conversation compaction is the next optimization. The lead agent's conversation grows across 6 phases. By the Report phase, it has accumulated all prior phase outputs. The article discusses cache-safe compaction (summarize but keep the same prefix). We use the 1M context beta, so this hasn't been an issue yet — but it's the logical next step if investigations grow larger.

Sub-agent prompt optimization: Sub-agent prompts were already static (no dynamic interpolation). No changes needed.

Takeaways

Audit before optimizing. Our cost tracking bug would have made any A/B test meaningless. Fix your instrumentation first.

The system prompt split is the highest-leverage change. Moving dynamic content from system prompt to the first user message took ~30 minutes of refactoring and produced a 14-point cache hit rate improvement.

Not everything in ResultMessage behaves the same. total_cost_usd is cumulative; token counts, turns, and duration are per-exchange. Test your assumptions empirically.

Cache hit rate should be a first-class metric. If you're building agentic systems on Claude, display it prominently. It's the single best indicator of whether your prompt architecture is working.