Can You Build a Futures Market for GPU Compute?
Cloud GPU pricing varies wildly across providers, time of day, and GPU model. gpu-spot.com tracks that variance every 30 minutes to answer a specific question: is the spread wide enough and predictable enough to support a compute futures order book?
The Question
Compute is becoming a commodity. An H100 SXM on Vast.ai is fungible with an H100 SXM on TensorDock — same chip, same VRAM, same CUDA cores. But there's no futures market for it. No one is trading GPU-hours the way they trade oil barrels or wheat bushels.
The interesting thing is that the preconditions might already exist. Different cloud GPU providers charge different prices for the same hardware, and those prices move. An H100 SXM might be $2.40/hr on Vast.ai right now and $1.99/hr on TensorDock. An hour later those numbers shift. If those spreads are wide enough, persistent enough, and somewhat predictable, you could build an order book around them — a place where someone who needs compute next Tuesday can lock in a price today, and someone with idle GPUs can sell that commitment.
That's the hypothesis. But hypothesis without data is just a blog post. gpu-spot.com is the data collection experiment.
What We're Tracking
The scraper runs every 30 minutes on a cron job, hitting the Vast.ai API and TensorDock's v2 locations endpoint. It collects spot prices for 8 GPU models across both platforms and stores every listing in a SQLite database — raw JSON blobs plus normalized fields for price, location, GPU count, vCPUs, and RAM.
| GPU | Vast.ai | TensorDock |
|---|---|---|
| H100 SXM | yes | yes |
| H100 NVL | yes | — |
| H200 | yes | — |
| H200 NVL | yes | — |
| RTX 4090 | yes | yes |
| RTX 5090 | yes | yes |
| B200 | yes | — |
| A100 SXM | yes | yes |
Vast.ai has the wider GPU selection. TensorDock has four of the eight models but tends to have tighter price ranges — fewer hosts, less variance. The cross-platform overlap (H100 SXM, RTX 4090, RTX 5090, A100 SXM) is where the arbitrage signal lives.
SQLite is the sole datastore. No Postgres, no managed database. For a 14-day experiment, the ops overhead of a database server is pure waste. The scraper writes, the dashboard reads (in read-only mode), and the file fits in a single SCP command when you need to move it around.
The Dashboard
gpu-spot.com is a Flask app serving two main pages and a JSON API, backed by that same SQLite file. The overview page shows cross-platform spreads, intra-market price ranges, and a 7-day timeseries for each GPU. Per-GPU detail pages break down price history, distribution buckets, and time-of-day patterns.
The design is dark mode only, monospace typography throughout, no gradients or shadows.
All CSS is inline in the HTML files — no build step, no bundler. Chart.js is the only
external dependency, self-hosted as a static file to avoid third-party cookies and
render-blocking CDN requests. The whole frontend is vanilla JS with fetch()
and DOM manipulation.
OG images are rendered server-side with Pillow. When you share a gpu-spot.com link in Slack or iMessage, the preview card shows the current best price signal for that GPU — live data, not a stale screenshot. They're cached for 30 minutes and regenerated on the next request.
Viability Thresholds
The experiment has explicit pass/fail criteria. These aren't aspirational — they're the minimum conditions for a compute futures market to function. If the data doesn't hit these numbers after a few weeks of collection, the answer is "not yet" and the experiment is done.
| Signal | Green light | Red light |
|---|---|---|
| Cross-platform spread | > $0.30/hr | < $0.15/hr |
| Coefficient of variation | > 0.15 | < 0.05 |
| Time-of-day swing | > 10% | — |
| Supply pool | > 20 GPUs/model | < 5 GPUs |
Cross-platform spread is the most important signal. If the same GPU costs $0.40/hr more on one platform than another, there's room for a market maker. Coefficient of variation measures whether prices actually move — a CV above 0.15 means the price fluctuates enough to make a futures contract meaningful. Time-of-day swings above 10% suggest predictable patterns that traders could exploit. And the supply pool needs to be deep enough that individual host outages don't dominate the signal.
These thresholds are encoded directly in the daily analysis pipeline, so every morning the system renders machine-readable verdicts: is this GPU's spread actionable? Is its volatility tradeable? The signals array stores these as structured data, not prose.
The Daily Analysis Agent
Raw data is necessary but not sufficient. Looking at a table of spot prices and mentally computing cross-platform spreads, 24-hour deltas, and volatility coefficients is exactly the kind of work an LLM should do. So the system runs a daily analysis pipeline via the Claude Agent SDK at 07:00 UTC every morning.
The pipeline has three phases:
Phase 1 — Context scout
Sonnet searches the web for recent GPU market news — supply chain shifts, new GPU launches, pricing announcements. Every claim must have a source URL. No training-data hallucinations about market conditions. If the scout finds nothing relevant, it says so and the pipeline falls back to yesterday's market context.
Phase 2 — Data orchestrator
Sonnet calls 7 MCP tools to query the database directly:
get_current_prices— floor, average, ceiling, spread, host count per GPU per platformget_price_changes— 24-hour and 7-day deltas versus historical averagesget_cross_deltas— TensorDock vs Vast spreads for overlapping GPUsget_supply_metrics— host counts, available GPUs, 24-hour supply changesget_volatility— standard deviation, coefficient of variation, tradeable thresholdget_tod_patterns— hourly price averages for intraday swing detectionget_price_floor_series— raw floor prices per snapshot for trend detection
The output is structured via Pydantic models — not free-form prose. The orchestrator
produces a DataAnalysisOutput with a lead signal (the single most interesting
finding), a title and summary derived from that signal, four sections (market snapshot,
movers or baselines, spread & volatility, market context), and an array of structured
signal verdicts.
class Signal(BaseModel):
gpu: str
type: str
value: float
actionable: bool
reason: str
class DataAnalysisOutput(BaseModel):
lead_signal: str
title: str
summary: str
sections: list[Section]
signals: list[Signal]
The sections are conditional. If data spans two or more days, Phase 2 writes a "movers" section highlighting GPUs with significant price changes. If the database is too young for meaningful deltas, it writes a "baselines" section instead — establishing reference points for future comparison.
Phase 3 — Context synthesis
Sonnet writes the "market context" section using only Phase 1's cited sources. This is the grounding constraint: the model can't inject macro commentary from its training data. Every claim in the market context traces to a URL the scout found that morning.
daily_analysis table and is served at
gpu-spot.com/analysis.
Agent SDK constraints
Building on the Claude Agent SDK surfaced a few sharp edges worth documenting, since they cost me time and aren't obvious from the docs:
AgentDefinitionsub-agents break final output. The SDK'sreceive_response()does not surface the orchestrator's final TextBlock when sub-agents are spawned. Direct MCP tool calls work correctly and are cheaper.permission_mode="bypassPermissions"fails as root. The workaround is acan_use_toolcallback that auto-approves tool calls.query()withcan_use_toolrequires streaming. All phases usestructured_outputviaquery()instead of manualClaudeSDKClientsessions.
These are recorded as ADRs in the repo. The pipeline works, but the path from docs to working code had a few detours.
What It Ships With
The project went from initial commit to production in 8 days, across 21 merged PRs. That's not because it's small — it's because the scope was ruthlessly constrained.
CI/CD via GitHub Actions. Merging to main triggers
pre-deploy checks (no CDN references, WCAG contrast validation, <main>
landmark, Python syntax) followed by SCP deploy and smoke tests that hit live endpoints.
Dashboard changes and scraper changes deploy independently to different paths on the
server.
Architecture decisions documented as ADRs. Ten decisions recorded: SQLite as the sole datastore, inline CSS with no build step, Flask with gunicorn, GPU name normalization strategy, self-hosted Chart.js, CI/CD with SSH deploy, server-side OG images, no JavaScript framework, the Agent SDK analysis pipeline, and unified deploy for dashboard + scraper.
Ruff + mypy + pytest in CI. Linting and type checking run on every push.
No noqa comments — if ruff complains, fix the code.
The whole thing runs on a single DigitalOcean droplet. nginx reverse-proxies to Flask via gunicorn with 2 workers. The scraper and analysis agent run on cron alongside it. For the expected traffic and the 14-day experiment window, this is the right amount of infrastructure.
What's Missing
The experiment is a week old. That's not enough data to answer the viability question.
Volatility signals need 2+ weeks to mature. Coefficient of variation requires enough snapshots to distinguish real price movement from noise. Time-of-day patterns need multiple full day cycles to be statistically meaningful. The analysis pipeline knows this — it switches from "baselines" to "movers" sections automatically once the data spans two or more days.
Two platforms isn't enough. Vast.ai and TensorDock are a start, but RunPod, Lambda, and CoreWeave all have spot or on-demand GPU pricing. More platforms means more cross-platform spreads, which means a richer signal for whether arbitrage opportunities exist. Adding a platform requires changes in four places (scraper targets, dashboard mappings, analysis mappings, navigation) — the complexity is manageable but it's a real cost.
The viability question isn't answered yet. That's the point. The experiment is designed to produce a definitive answer: either the spreads are wide and persistent enough to support an order book, or they're not. Both outcomes are useful. A "no" is still a concrete finding backed by data — it means GPU compute pricing is more efficient than the hypothesis assumed, which is interesting in its own right.
I'll have a real answer in 2–4 weeks. The data is accumulating. The analysis pipeline is running every morning. The thresholds are set. When there's enough data to cross-check the viability criteria, that'll be its own post.