Building an Automated Bug Investigation Agent with Claude

From bug report to 60-page PDF in an hour — how a 7-agent pipeline replaced days of manual investigation work.

The Problem

A user writes in: "I uploaded a document for review and got back an analysis of a completely different file." That's the entire bug report. Figuring out what actually happened requires:

  • Finding the user across Firebase Auth and BigQuery activity logs
  • Pulling their conversations and matching the reported timeframe
  • Downloading document artifacts from Cloud Storage — what they uploaded vs. what was generated
  • Tracing execution graphs from S3 to see how the processing pipeline handled the file
  • Checking Cloud Logging for errors, 429s, timeouts
  • Reading application source code to understand the failure path

A thorough investigation touches 6+ data sources, generates hundreds of pages of raw evidence, and takes an engineer 1–2 days. We needed it done in an hour, by people who aren't engineers.

Architecture: A 6-Phase Pipeline

The system is built on the Claude Agent SDK. A lead agent coordinates 7 specialist sub-agents through 6 sequential phases. Each phase delegates to the appropriate sub-agent, collects the result, and moves to the next.

  Bug Report (natural language)
        |
  +-----v------+
  |   INTAKE    |  Lead (Sonnet) — scope & user lookup
  +-----+------+
        |
  +-----v------+
  |   COLLECT   |  Collector — BQ, GCS, S3, logs
  +-----+------+
        |
  +-----v------+
  |   PROCESS   |  Processor — diffs, timelines, chunking
  +-----+------+
        |
  +-----v------+
  |   ANALYZE   |  Analyst (Opus) — root cause, code tracing
  +-----+------+
        |
  +-----v------+
  |   REPORT    |  4 Writers — skeleton, body, appendices, compile
  +-----+------+
        |
  +-----v------+
  |   VERIFY    |  Lead — PDF validation & summary
  +-----+------+
        |
        v
  60-page PDF Report

The critical architectural decision: the lead agent runs on Sonnet, not Opus. It's a dispatcher — it decides which phase to run next, formats the handoff, and checks the result. It doesn't need deep reasoning. The analyst sub-agent, which does root cause analysis and code tracing, runs on Opus. Matching model capability to task complexity keeps costs down without sacrificing quality where it matters.

Sub-agents are defined as AgentDefinition instances dispatched via the SDK's Task tool. The lead treats them as function calls — hand off context, get back structured results. The report phase dispatches body and appendix writers in parallel (same turn), since they're independent.

The Skill System: 41 Prompt Fragments

Each sub-agent's system prompt is composed from modular "skills" — numbered prompt fragments that describe a specific capability. There are 41 skills across four categories: collection (S01–S08), processing (S09–S13), analysis (S14–S22), and reporting (S24–S41). Each skill is a Python function that returns a prompt string.

This sounds like over-engineering until you see the payoff. Skills are versioned, auditable, and independently improvable. When I wanted the analyst to catch infrastructure failures — things like 429 quota errors or Cloud Run cold-start timeouts — I added a single skill that described the "infrastructure" bug category with a few example signals. The analyst immediately started flagging quota exhaustion events it had been ignoring. One prompt fragment, zero code changes, measurable impact.

The skill system also made it possible to diagnose a critical performance problem. Skill S23, "investigation scope determination," was supposed to help the lead agent plan its work. In practice, it caused the lead to spiral — 20+ turns of self-reflection before doing anything useful, at $18.80 per investigation just for the intake phase. Removing S23 collapsed intake cost to $0.27. The lead's phase instructions already contained adequate scope constraints. The skill was redundant, and its removal was a one-line change.

The Tool Layer: 14 MCP Tools

Tools are served via Model Context Protocol using FastMCP. The server registers 14 tools across 7 categories:

  • BigQuery (6 tools) — user lookup, conversation listing, turn fetching, batch turn fetching, activity summaries, raw SQL
  • Cloud Storage (2 tools) — artifact version enumeration and download
  • S3 (1 tool) — execution graph retrieval
  • Cloud Logging (1 tool) — application error log queries
  • Diff (1 tool) — unified diffs between document versions
  • LaTeX (1 tool) — pdflatex + biber compilation pipeline
  • Data processing (2 tools + python_eval) — turn search, log parsing, general-purpose in-process Python

The tool design reflects a deliberate split: bespoke fast-path tools for common operations, plus a general-purpose python_eval for the long tail. The bespoke tools (turn search, log parsing) are optimized Python functions that run in-process in milliseconds. python_eval handles ad-hoc data processing — anything the agent needs to compute that doesn't have a dedicated tool. It executes via exec() in the MCP server process, which gives the same flexibility as a Bash subprocess running Python, but at MCP speed (~milliseconds vs. ~120–240 seconds for subprocess spawning in the Claude Agent SDK).

Design invariant: The tool list is static and identically ordered across all agents and all investigations. This is critical for prompt caching — tools are part of the cacheable prefix. Changing tool order or definitions mid-session busts the cache.

Performance: The Optimization Story

The first investigation run cost $19.85 and took 69 minutes. Five iterations later, same investigation, same prompt — $16.64 and 60 minutes. Here's the progression, running the same benchmark prompt each time:

RunLead ModelIntakeTotal CostTimeKey Change
1Opus$18.80$19.8569 minBaseline (S23 in prompt, 10K thinking budget)
2Sonnet$15.65$16.8962 minCaching fixes applied, intake safety-net bug
3Sonnet$0.27$17.0163 minIntake fixed — 7 turns, safety net works
4Sonnet$0.30$16.7961 minAll skill audit fixes. 96.2% cache hit rate
5Sonnet$0.39$16.6460 minBatch BQ: 11 queries (507s) → 1 query (47s)

The Four Big Wins

Prompt caching was the foundation. Static system prompts, session-context in the first user message, static tool ordering. This got us to 92–96% cache hit rates — meaning 90%+ of input tokens are served at cache-read pricing.

Batch BigQuery queries eliminated serial data fetching. The collector agent originally fetched conversation turns one at a time — 11 separate BigQuery jobs totaling 507 seconds. BigQuery on-demand slots are shared; N concurrent queries each run ~Nx slower. A single WHERE IN UNNEST query fetching all conversations at once completed in 47 seconds. Same data, 10x faster.

In-process data tools replaced Bash subprocess calls. The processor agent was running Python scripts via Bash for turn searching and log parsing — each subprocess call took 120–240 seconds due to SDK overhead. Moving those operations into the MCP server process (Python functions called via MCP) dropped 481 seconds of wall time to under 1 second.

Model selection put intelligence where it matters. The lead agent switched from Opus to Sonnet with zero quality loss on coordination tasks. Only the analyst — which does root cause analysis, code path tracing, and impact assessment — runs on Opus. This cut lead-agent costs significantly while preserving the depth of analysis that makes the reports valuable.

Deployment: Three Services on Cloud Run

The system ships as three Cloud Run services built from the same Docker image with different entry points:

MCP service — an SSE endpoint that exposes the investigation tools to Claude Desktop and Claude Code. Team members connect via a JSON config with a bearer token. This is how non-engineers trigger investigations: they open Claude, paste a bug report, and the MCP tools handle the rest. Min-instance stays warm for responsiveness.

Job service — runs full 6-phase investigations as Cloud Run jobs with a 2-hour timeout. Writes phase progress to Firestore after each phase, uploads artifacts to Cloud Storage, and posts results back to Slack threads. Supports checkpoint save/resume for interrupted investigations.

Chat service — handles follow-up questions after an investigation completes. Downloads Agent SDK checkpoints from Cloud Storage, resumes the session with full context intact, processes the follow-up message, and re-uploads the checkpoint. This means you can ask "what would happen if we increased the rate limit?" three days after the investigation ran, and the agent has the complete investigation context available.

Who Uses It

The whole point was to make bug investigation accessible to people who can't query BigQuery. Designers, product leads, attorneys, and support staff all trigger investigations independently. They paste a bug report — sometimes just a forwarded email — into Claude, and get a 60-page PDF with root cause analysis, code references, impact assessment, and fix recommendations.

Multiple investigations run in parallel against different users and timeframes. The MCP service handles concurrent connections. Follow-up questions via checkpoint resume mean the investigation is a conversation, not a one-shot report — you can drill into specific findings, ask about alternative explanations, or request additional data collection.

What Surprised Us

The lead agent needs speed, not intelligence. Switching the lead from Opus to Sonnet produced no measurable quality difference in investigation output. The lead's job is dispatch — read the phase, pick the sub-agent, format the handoff, check the result. Sonnet handles this perfectly. I'd been paying for deep reasoning capacity that was doing shallow coordination work.

One skill removal was the biggest single optimization. Removing skill S23 ("investigation scope determination") collapsed intake cost from $18.80 to $0.27 and intake time from 67 minutes to 57 seconds. The skill was causing the lead to over-deliberate about what to investigate before investigating anything. The phase instructions already said "3–5 BigQuery calls, then move on." The skill contradicted that by asking the agent to think deeply about scope first.

The analyst genuinely finds unknown bugs. In one investigation triggered by a user reporting document upload errors, the analyst traced a WebSocket double-close cascade that nobody had reported. The user's symptom was a side effect of a deeper infrastructure issue. The agent found it by correlating Cloud Logging errors with execution graph timing — the kind of cross-source analysis that's tedious for humans but natural for an agent with access to all the data.

Missing staging data was a real blind spot. Early investigations only had production data access. The agent couldn't compare "what happened" against "what should have happened" because it couldn't see the staging environment where expected behavior was defined. Adding staging data access was one of the last changes and immediately improved the analyst's root cause assessments.

Eight Days of Shipping

The project went from initial commit to production in eight days:

DayDateMilestone
1Feb 20Initial commit. First successful investigation that same night.
2–3Feb 21–22Performance optimization: prompt caching, batch BQ, in-process tools, model selection.
4Feb 23Cloud Run deployment — all three services live.
5–6Feb 24–25Firestore state tracking, Slack integration, checkpoint resume for follow-up questions.
7Feb 26Sentry integration for error correlation.
8Feb 28Staging data access, infrastructure bug category, final skill audit.

18 commits, roughly 10,000 lines of Python, 3 Cloud Run services. The pipeline, skills, tools, deployment, and integrations were all built and shipped inside that window.

Takeaways

Multi-agent specialist beats generalist. A coordinator that dispatches focused sub-agents outperforms a single agent trying to do everything. The sub-agents have narrow system prompts, purpose-built tools, and appropriate model selection. The lead agent stays fast and cheap.

Prompt engineering is software engineering. Skills are versioned modules with defined interfaces. Adding, removing, or modifying a skill has predictable, testable effects. The skill system turned prompt iteration from "change some text and see what happens" into "deploy a specific capability and measure the result."

The hardest part is data access, not pipeline logic. Building the 6-phase pipeline took a day. Getting reliable access to BigQuery, Cloud Storage, S3, Cloud Logging, Firebase Auth, and Sentry — with correct permissions, schema discovery, and edge case handling — took the rest of the week.

Caching + batching + in-process execution. These three optimizations compose multiplicatively. Prompt caching cuts per-token cost by 90%. Batch queries cut data fetching time by 10x. In-process tools cut processing overhead by 500x. Together they turned a $20, 70-minute investigation into a $17, 60-minute one — and the ceiling is much lower once the remaining subprocess calls are eliminated.