Building an Automated Bug Investigation Agent with Claude
From bug report to 60-page PDF in an hour — how a 7-agent pipeline replaced days of manual investigation work.
The Problem
A user writes in: "I uploaded a document for review and got back an analysis of a completely different file." That's the entire bug report. Figuring out what actually happened requires:
- Finding the user across Firebase Auth and BigQuery activity logs
- Pulling their conversations and matching the reported timeframe
- Downloading document artifacts from Cloud Storage — what they uploaded vs. what was generated
- Tracing execution graphs from S3 to see how the processing pipeline handled the file
- Checking Cloud Logging for errors, 429s, timeouts
- Reading application source code to understand the failure path
A thorough investigation touches 6+ data sources, generates hundreds of pages of raw evidence, and takes an engineer 1–2 days. We needed it done in an hour, by people who aren't engineers.
Architecture: A 6-Phase Pipeline
The system is built on the Claude Agent SDK. A lead agent coordinates 7 specialist sub-agents through 6 sequential phases. Each phase delegates to the appropriate sub-agent, collects the result, and moves to the next.
Bug Report (natural language)
|
+-----v------+
| INTAKE | Lead (Sonnet) — scope & user lookup
+-----+------+
|
+-----v------+
| COLLECT | Collector — BQ, GCS, S3, logs
+-----+------+
|
+-----v------+
| PROCESS | Processor — diffs, timelines, chunking
+-----+------+
|
+-----v------+
| ANALYZE | Analyst (Opus) — root cause, code tracing
+-----+------+
|
+-----v------+
| REPORT | 4 Writers — skeleton, body, appendices, compile
+-----+------+
|
+-----v------+
| VERIFY | Lead — PDF validation & summary
+-----+------+
|
v
60-page PDF Report
The critical architectural decision: the lead agent runs on Sonnet, not Opus. It's a dispatcher — it decides which phase to run next, formats the handoff, and checks the result. It doesn't need deep reasoning. The analyst sub-agent, which does root cause analysis and code tracing, runs on Opus. Matching model capability to task complexity keeps costs down without sacrificing quality where it matters.
Sub-agents are defined as AgentDefinition instances dispatched via the
SDK's Task tool. The lead treats them as function calls — hand off context,
get back structured results. The report phase dispatches body and appendix writers
in parallel (same turn), since they're independent.
The Skill System: 41 Prompt Fragments
Each sub-agent's system prompt is composed from modular "skills" — numbered prompt fragments that describe a specific capability. There are 41 skills across four categories: collection (S01–S08), processing (S09–S13), analysis (S14–S22), and reporting (S24–S41). Each skill is a Python function that returns a prompt string.
This sounds like over-engineering until you see the payoff. Skills are versioned, auditable, and independently improvable. When I wanted the analyst to catch infrastructure failures — things like 429 quota errors or Cloud Run cold-start timeouts — I added a single skill that described the "infrastructure" bug category with a few example signals. The analyst immediately started flagging quota exhaustion events it had been ignoring. One prompt fragment, zero code changes, measurable impact.
The skill system also made it possible to diagnose a critical performance problem. Skill S23, "investigation scope determination," was supposed to help the lead agent plan its work. In practice, it caused the lead to spiral — 20+ turns of self-reflection before doing anything useful, at $18.80 per investigation just for the intake phase. Removing S23 collapsed intake cost to $0.27. The lead's phase instructions already contained adequate scope constraints. The skill was redundant, and its removal was a one-line change.
The Tool Layer: 14 MCP Tools
Tools are served via Model Context Protocol using FastMCP. The server registers 14 tools across 7 categories:
- BigQuery (6 tools) — user lookup, conversation listing, turn fetching, batch turn fetching, activity summaries, raw SQL
- Cloud Storage (2 tools) — artifact version enumeration and download
- S3 (1 tool) — execution graph retrieval
- Cloud Logging (1 tool) — application error log queries
- Diff (1 tool) — unified diffs between document versions
- LaTeX (1 tool) — pdflatex + biber compilation pipeline
- Data processing (2 tools + python_eval) — turn search, log parsing, general-purpose in-process Python
The tool design reflects a deliberate split: bespoke fast-path tools for
common operations, plus a general-purpose python_eval for
the long tail. The bespoke tools (turn search, log parsing) are optimized Python functions
that run in-process in milliseconds. python_eval handles ad-hoc data processing —
anything the agent needs to compute that doesn't have a dedicated tool. It executes via
exec() in the MCP server process, which gives the same flexibility as a
Bash subprocess running Python, but at MCP speed (~milliseconds vs. ~120–240 seconds for
subprocess spawning in the Claude Agent SDK).
Performance: The Optimization Story
The first investigation run cost $19.85 and took 69 minutes. Five iterations later, same investigation, same prompt — $16.64 and 60 minutes. Here's the progression, running the same benchmark prompt each time:
| Run | Lead Model | Intake | Total Cost | Time | Key Change |
|---|---|---|---|---|---|
| 1 | Opus | $18.80 | $19.85 | 69 min | Baseline (S23 in prompt, 10K thinking budget) |
| 2 | Sonnet | $15.65 | $16.89 | 62 min | Caching fixes applied, intake safety-net bug |
| 3 | Sonnet | $0.27 | $17.01 | 63 min | Intake fixed — 7 turns, safety net works |
| 4 | Sonnet | $0.30 | $16.79 | 61 min | All skill audit fixes. 96.2% cache hit rate |
| 5 | Sonnet | $0.39 | $16.64 | 60 min | Batch BQ: 11 queries (507s) → 1 query (47s) |
The Four Big Wins
Prompt caching was the foundation. Static system prompts, session-context in the first user message, static tool ordering. This got us to 92–96% cache hit rates — meaning 90%+ of input tokens are served at cache-read pricing.
Batch BigQuery queries eliminated serial data fetching. The collector
agent originally fetched conversation turns one at a time — 11 separate BigQuery jobs
totaling 507 seconds. BigQuery on-demand slots are shared; N concurrent queries each
run ~Nx slower. A single WHERE IN UNNEST query fetching all conversations
at once completed in 47 seconds. Same data, 10x faster.
In-process data tools replaced Bash subprocess calls. The processor agent was running Python scripts via Bash for turn searching and log parsing — each subprocess call took 120–240 seconds due to SDK overhead. Moving those operations into the MCP server process (Python functions called via MCP) dropped 481 seconds of wall time to under 1 second.
Model selection put intelligence where it matters. The lead agent switched from Opus to Sonnet with zero quality loss on coordination tasks. Only the analyst — which does root cause analysis, code path tracing, and impact assessment — runs on Opus. This cut lead-agent costs significantly while preserving the depth of analysis that makes the reports valuable.
Deployment: Three Services on Cloud Run
The system ships as three Cloud Run services built from the same Docker image with different entry points:
MCP service — an SSE endpoint that exposes the investigation tools to Claude Desktop and Claude Code. Team members connect via a JSON config with a bearer token. This is how non-engineers trigger investigations: they open Claude, paste a bug report, and the MCP tools handle the rest. Min-instance stays warm for responsiveness.
Job service — runs full 6-phase investigations as Cloud Run jobs with a 2-hour timeout. Writes phase progress to Firestore after each phase, uploads artifacts to Cloud Storage, and posts results back to Slack threads. Supports checkpoint save/resume for interrupted investigations.
Chat service — handles follow-up questions after an investigation completes. Downloads Agent SDK checkpoints from Cloud Storage, resumes the session with full context intact, processes the follow-up message, and re-uploads the checkpoint. This means you can ask "what would happen if we increased the rate limit?" three days after the investigation ran, and the agent has the complete investigation context available.
Who Uses It
The whole point was to make bug investigation accessible to people who can't query BigQuery. Designers, product leads, attorneys, and support staff all trigger investigations independently. They paste a bug report — sometimes just a forwarded email — into Claude, and get a 60-page PDF with root cause analysis, code references, impact assessment, and fix recommendations.
Multiple investigations run in parallel against different users and timeframes. The MCP service handles concurrent connections. Follow-up questions via checkpoint resume mean the investigation is a conversation, not a one-shot report — you can drill into specific findings, ask about alternative explanations, or request additional data collection.
What Surprised Us
The lead agent needs speed, not intelligence. Switching the lead from Opus to Sonnet produced no measurable quality difference in investigation output. The lead's job is dispatch — read the phase, pick the sub-agent, format the handoff, check the result. Sonnet handles this perfectly. I'd been paying for deep reasoning capacity that was doing shallow coordination work.
One skill removal was the biggest single optimization. Removing skill S23 ("investigation scope determination") collapsed intake cost from $18.80 to $0.27 and intake time from 67 minutes to 57 seconds. The skill was causing the lead to over-deliberate about what to investigate before investigating anything. The phase instructions already said "3–5 BigQuery calls, then move on." The skill contradicted that by asking the agent to think deeply about scope first.
The analyst genuinely finds unknown bugs. In one investigation triggered by a user reporting document upload errors, the analyst traced a WebSocket double-close cascade that nobody had reported. The user's symptom was a side effect of a deeper infrastructure issue. The agent found it by correlating Cloud Logging errors with execution graph timing — the kind of cross-source analysis that's tedious for humans but natural for an agent with access to all the data.
Missing staging data was a real blind spot. Early investigations only had production data access. The agent couldn't compare "what happened" against "what should have happened" because it couldn't see the staging environment where expected behavior was defined. Adding staging data access was one of the last changes and immediately improved the analyst's root cause assessments.
Eight Days of Shipping
The project went from initial commit to production in eight days:
| Day | Date | Milestone |
|---|---|---|
| 1 | Feb 20 | Initial commit. First successful investigation that same night. |
| 2–3 | Feb 21–22 | Performance optimization: prompt caching, batch BQ, in-process tools, model selection. |
| 4 | Feb 23 | Cloud Run deployment — all three services live. |
| 5–6 | Feb 24–25 | Firestore state tracking, Slack integration, checkpoint resume for follow-up questions. |
| 7 | Feb 26 | Sentry integration for error correlation. |
| 8 | Feb 28 | Staging data access, infrastructure bug category, final skill audit. |
18 commits, roughly 10,000 lines of Python, 3 Cloud Run services. The pipeline, skills, tools, deployment, and integrations were all built and shipped inside that window.
Takeaways
Multi-agent specialist beats generalist. A coordinator that dispatches focused sub-agents outperforms a single agent trying to do everything. The sub-agents have narrow system prompts, purpose-built tools, and appropriate model selection. The lead agent stays fast and cheap.
Prompt engineering is software engineering. Skills are versioned modules with defined interfaces. Adding, removing, or modifying a skill has predictable, testable effects. The skill system turned prompt iteration from "change some text and see what happens" into "deploy a specific capability and measure the result."
The hardest part is data access, not pipeline logic. Building the 6-phase pipeline took a day. Getting reliable access to BigQuery, Cloud Storage, S3, Cloud Logging, Firebase Auth, and Sentry — with correct permissions, schema discovery, and edge case handling — took the rest of the week.
Caching + batching + in-process execution. These three optimizations compose multiplicatively. Prompt caching cuts per-token cost by 90%. Batch queries cut data fetching time by 10x. In-process tools cut processing overhead by 500x. Together they turned a $20, 70-minute investigation into a $17, 60-minute one — and the ceiling is much lower once the remaining subprocess calls are eliminated.