Can You Build a Futures Market for GPU Compute?

Cloud GPU pricing varies wildly across providers, time of day, and GPU model. gpu-spot.com tracks that variance every 30 minutes to answer a specific question: is the spread wide enough and predictable enough to support a compute futures order book?

The Question

Compute is becoming a commodity. An H100 SXM on Vast.ai is fungible with an H100 SXM on TensorDock — same chip, same VRAM, same CUDA cores. But there's no futures market for it. No one is trading GPU-hours the way they trade oil barrels or wheat bushels.

The interesting thing is that the preconditions might already exist. Different cloud GPU providers charge different prices for the same hardware, and those prices move. An H100 SXM might be $2.40/hr on Vast.ai right now and $1.99/hr on TensorDock. An hour later those numbers shift. If those spreads are wide enough, persistent enough, and somewhat predictable, you could build an order book around them — a place where someone who needs compute next Tuesday can lock in a price today, and someone with idle GPUs can sell that commitment.

That's the hypothesis. But hypothesis without data is just a blog post. gpu-spot.com is the data collection experiment.

What We're Tracking

The scraper runs every 30 minutes on a cron job, hitting the Vast.ai API and TensorDock's v2 locations endpoint. It collects spot prices for 8 GPU models across both platforms and stores every listing in a SQLite database — raw JSON blobs plus normalized fields for price, location, GPU count, vCPUs, and RAM.

GPUVast.aiTensorDock
H100 SXMyesyes
H100 NVLyes
H200yes
H200 NVLyes
RTX 4090yesyes
RTX 5090yesyes
B200yes
A100 SXMyesyes

Vast.ai has the wider GPU selection. TensorDock has four of the eight models but tends to have tighter price ranges — fewer hosts, less variance. The cross-platform overlap (H100 SXM, RTX 4090, RTX 5090, A100 SXM) is where the arbitrage signal lives.

SQLite is the sole datastore. No Postgres, no managed database. For a 14-day experiment, the ops overhead of a database server is pure waste. The scraper writes, the dashboard reads (in read-only mode), and the file fits in a single SCP command when you need to move it around.

The Dashboard

gpu-spot.com is a Flask app serving two main pages and a JSON API, backed by that same SQLite file. The overview page shows cross-platform spreads, intra-market price ranges, and a 7-day timeseries for each GPU. Per-GPU detail pages break down price history, distribution buckets, and time-of-day patterns.

The design is dark mode only, monospace typography throughout, no gradients or shadows. All CSS is inline in the HTML files — no build step, no bundler. Chart.js is the only external dependency, self-hosted as a static file to avoid third-party cookies and render-blocking CDN requests. The whole frontend is vanilla JS with fetch() and DOM manipulation.

OG images are rendered server-side with Pillow. When you share a gpu-spot.com link in Slack or iMessage, the preview card shows the current best price signal for that GPU — live data, not a stale screenshot. They're cached for 30 minutes and regenerated on the next request.

Viability Thresholds

The experiment has explicit pass/fail criteria. These aren't aspirational — they're the minimum conditions for a compute futures market to function. If the data doesn't hit these numbers after a few weeks of collection, the answer is "not yet" and the experiment is done.

SignalGreen lightRed light
Cross-platform spread> $0.30/hr< $0.15/hr
Coefficient of variation> 0.15< 0.05
Time-of-day swing> 10%
Supply pool> 20 GPUs/model< 5 GPUs

Cross-platform spread is the most important signal. If the same GPU costs $0.40/hr more on one platform than another, there's room for a market maker. Coefficient of variation measures whether prices actually move — a CV above 0.15 means the price fluctuates enough to make a futures contract meaningful. Time-of-day swings above 10% suggest predictable patterns that traders could exploit. And the supply pool needs to be deep enough that individual host outages don't dominate the signal.

These thresholds are encoded directly in the daily analysis pipeline, so every morning the system renders machine-readable verdicts: is this GPU's spread actionable? Is its volatility tradeable? The signals array stores these as structured data, not prose.

The Daily Analysis Agent

Raw data is necessary but not sufficient. Looking at a table of spot prices and mentally computing cross-platform spreads, 24-hour deltas, and volatility coefficients is exactly the kind of work an LLM should do. So the system runs a daily analysis pipeline via the Claude Agent SDK at 07:00 UTC every morning.

The pipeline has three phases:

Phase 1 — Context scout

Sonnet searches the web for recent GPU market news — supply chain shifts, new GPU launches, pricing announcements. Every claim must have a source URL. No training-data hallucinations about market conditions. If the scout finds nothing relevant, it says so and the pipeline falls back to yesterday's market context.

Phase 2 — Data orchestrator

Sonnet calls 7 MCP tools to query the database directly:

  • get_current_prices — floor, average, ceiling, spread, host count per GPU per platform
  • get_price_changes — 24-hour and 7-day deltas versus historical averages
  • get_cross_deltas — TensorDock vs Vast spreads for overlapping GPUs
  • get_supply_metrics — host counts, available GPUs, 24-hour supply changes
  • get_volatility — standard deviation, coefficient of variation, tradeable threshold
  • get_tod_patterns — hourly price averages for intraday swing detection
  • get_price_floor_series — raw floor prices per snapshot for trend detection

The output is structured via Pydantic models — not free-form prose. The orchestrator produces a DataAnalysisOutput with a lead signal (the single most interesting finding), a title and summary derived from that signal, four sections (market snapshot, movers or baselines, spread & volatility, market context), and an array of structured signal verdicts.

class Signal(BaseModel):
    gpu: str
    type: str
    value: float
    actionable: bool
    reason: str

class DataAnalysisOutput(BaseModel):
    lead_signal: str
    title: str
    summary: str
    sections: list[Section]
    signals: list[Signal]

The sections are conditional. If data spans two or more days, Phase 2 writes a "movers" section highlighting GPUs with significant price changes. If the database is too young for meaningful deltas, it writes a "baselines" section instead — establishing reference points for future comparison.

Phase 3 — Context synthesis

Sonnet writes the "market context" section using only Phase 1's cited sources. This is the grounding constraint: the model can't inject macro commentary from its training data. Every claim in the market context traces to a URL the scout found that morning.

Cost: The full pipeline runs for ~$0.80/day. Phase 1 (scout) is ~$0.20, Phase 2 (orchestrator) is ~$0.60, Phase 3 (synthesis) is ~$0.02. Output goes into a daily_analysis table and is served at gpu-spot.com/analysis.

Agent SDK constraints

Building on the Claude Agent SDK surfaced a few sharp edges worth documenting, since they cost me time and aren't obvious from the docs:

  • AgentDefinition sub-agents break final output. The SDK's receive_response() does not surface the orchestrator's final TextBlock when sub-agents are spawned. Direct MCP tool calls work correctly and are cheaper.
  • permission_mode="bypassPermissions" fails as root. The workaround is a can_use_tool callback that auto-approves tool calls.
  • query() with can_use_tool requires streaming. All phases use structured_output via query() instead of manual ClaudeSDKClient sessions.

These are recorded as ADRs in the repo. The pipeline works, but the path from docs to working code had a few detours.

What It Ships With

The project went from initial commit to production in 8 days, across 21 merged PRs. That's not because it's small — it's because the scope was ruthlessly constrained.

CI/CD via GitHub Actions. Merging to main triggers pre-deploy checks (no CDN references, WCAG contrast validation, <main> landmark, Python syntax) followed by SCP deploy and smoke tests that hit live endpoints. Dashboard changes and scraper changes deploy independently to different paths on the server.

Architecture decisions documented as ADRs. Ten decisions recorded: SQLite as the sole datastore, inline CSS with no build step, Flask with gunicorn, GPU name normalization strategy, self-hosted Chart.js, CI/CD with SSH deploy, server-side OG images, no JavaScript framework, the Agent SDK analysis pipeline, and unified deploy for dashboard + scraper.

Ruff + mypy + pytest in CI. Linting and type checking run on every push. No noqa comments — if ruff complains, fix the code.

The whole thing runs on a single DigitalOcean droplet. nginx reverse-proxies to Flask via gunicorn with 2 workers. The scraper and analysis agent run on cron alongside it. For the expected traffic and the 14-day experiment window, this is the right amount of infrastructure.

What's Missing

The experiment is a week old. That's not enough data to answer the viability question.

Volatility signals need 2+ weeks to mature. Coefficient of variation requires enough snapshots to distinguish real price movement from noise. Time-of-day patterns need multiple full day cycles to be statistically meaningful. The analysis pipeline knows this — it switches from "baselines" to "movers" sections automatically once the data spans two or more days.

Two platforms isn't enough. Vast.ai and TensorDock are a start, but RunPod, Lambda, and CoreWeave all have spot or on-demand GPU pricing. More platforms means more cross-platform spreads, which means a richer signal for whether arbitrage opportunities exist. Adding a platform requires changes in four places (scraper targets, dashboard mappings, analysis mappings, navigation) — the complexity is manageable but it's a real cost.

The viability question isn't answered yet. That's the point. The experiment is designed to produce a definitive answer: either the spreads are wide and persistent enough to support an order book, or they're not. Both outcomes are useful. A "no" is still a concrete finding backed by data — it means GPU compute pricing is more efficient than the hypothesis assumed, which is interesting in its own right.

I'll have a real answer in 2–4 weeks. The data is accumulating. The analysis pipeline is running every morning. The thresholds are set. When there's enough data to cross-check the viability criteria, that'll be its own post.