2026-03-06 #infrastructure #agents #nix 8 min read

Harness Engineering with NixOS, Terraform, and OrbStack

OpenAI calls it harness engineering — designing environments where agents do reliable work. Here's what that looks like for a solo developer using NixOS, Terraform, and OrbStack with both Claude Code and Codex.

When Agents Write the Code, What Do You Write?

OpenAI recently published "Harness Engineering: leveraging Codex in an agent-first world." The headline number: a million lines of code, zero written by hand, shipped as a beta product in five months. The team's first commit — repository structure, CI config, formatting rules, package manager setup — was generated by Codex CLI using GPT-5.

The interesting part isn't the code generation. It's the job description that emerges when you stop writing code. OpenAI's engineers spent their time designing environments, specifying intent, and building feedback loops. They call this harness engineering: the work of making agents productive. The scarce resource is human attention, not code output.

I'm a solo developer building 10kdiff, a diff tool for SEC 10-K filings. I use Claude Code and Codex side by side. My codebase is a few thousand lines, not a million. But I've been converging on the same idea from a completely different direction, and the stack I'm building is worth describing because it's small enough to see the whole thing at once.

What a Harness Looks Like at Small Scale

OpenAI has a dedicated team of harness engineers and a million-line codebase with enforced architectural layers. I have a monorepo and a CLAUDE.md file. But the same four principles show up at both scales:

Declarative environments. The system is defined in config, not in someone's head. An agent can read it and reason about it.
Repo-local knowledge. Everything the agent needs to know lives in the repo. ADRs, module configs, a doc index. If it's not in git, it doesn't exist.
Mechanical enforcement. nix flake check validates the entire system — packages, modules, host configs. Ruff and pre-commit hooks catch code issues. The agent gets fast, unambiguous signals about what's broken.
Agent legibility. Config files, directory structure, and docs are written for agents to read, not just humans. OpenAI calls this "from the agent's point of view, anything it can't access in-context while running effectively doesn't exist."

The harness doesn't require a dedicated team. It requires a discipline: every time I'm about to explain something to the agent in a chat message, I ask whether that information should live in the repo instead. If I'm repeating myself, it should be a file.

The Stack: NixOS + Terraform + OrbStack

Each piece of the stack exists because it makes the system more legible to an agent.

NixOS

NixOS is declarative, reproducible, and everything-in-the-repo. The production server and the local dev VM use the same module system, defined in the same flake.nix. Services are NixOS modules in nix/modules/ — tenk-api, tenk-worker, tenk-migrate, temporal-dev, tenk-web — composed into host configs in nix/hosts/.

nix flake check is mechanical enforcement. It validates every package, every module, every host configuration. When the agent modifies a Nix file, it runs the check and gets an unambiguous pass/fail. No "it works on my machine" — if the flake checks pass, the config is valid for every host that imports it.

The key property: an agent can read a NixOS module and know what it does. The module declares its options, its dependencies, its systemd services. There's no hidden state. Compare this to a set of shell scripts and Ansible playbooks — the agent would have to trace execution paths across multiple files and guess about runtime state. Nix is a language agents can reason about because it's purely functional. Same inputs, same outputs.

Terraform

Terraform is infrastructure-as-code for DigitalOcean droplets and Cloudflare DNS. The agent doesn't SSH into servers — it writes .tf files and runs terraform apply. Provisioning is a code change, not a manual operation.

This matters for the same reason NixOS matters: the agent can reason about infrastructure from the repo. A droplet's size, region, and image are in a .tf file, not in a DigitalOcean dashboard. DNS records are in Terraform state, not in someone's Cloudflare account. The agent sees the full system from git clone.

OrbStack

OrbStack runs NixOS VMs on macOS with near-native performance. I run a full-stack local dev environment — API, worker, Temporal, PostgreSQL — inside an OrbStack VM. ./scripts/dev-nix.sh rebuilds the VM from the same Nix modules that define production.

The agent gets a real environment to test against. Not a mocked service, not a Docker compose file that approximates production — the actual NixOS config running locally. When the agent makes a change and runs the dev script, it's testing against the same service definitions that will run in production.

The Worktree-First SDLC

The development flow is built around git worktrees, and every step is a skill the agent already knows:

/worktree fix/cache-unification     # create branch + worktree
# ... agent works in isolated worktree ...
/test                                # version bump → PR → preview deploy → smoke tests
/deploy                              # merge to main → verify production → cleanup

Each worktree is a unit of work. The agent creates a branch, the local NixOS VM picks up the changes, code ships, a PR opens, the frontend gets a Cloudflare Pages preview, tests run. The agent never needs to remember how to push a branch or open a PR — those are encoded as skills in .claude/commands/, and the agent invokes them by name.

This is now live. Pushing to a PR provisions an ephemeral DigitalOcean droplet, creates api-pr-N.10kdiff.com in Cloudflare via Terraform, bootstraps NixOS if needed, and deploys the API/worker/migrate stack to that host. The frontend still ships to a PR-scoped Cloudflare Pages branch and points at the PR API URL.

Closing the PR tears down both sides: the API droplet + DNS record and the Pages preview deployments. The harness now manages resource lifecycle in both directions, not just deploy.

This is what OpenAI means by harness engineering. The developer's job isn't writing the code for cache unification or Nix module cleanup. The developer's job is building the system that lets the agent write that code, test it in a real environment, and ship it through a well-defined pipeline.

What I Shipped This Week

Concrete examples from the current development session, all done by agents working in worktrees:

Unified cache directory. The API and worker services were using separate DiskCache paths. Now they share /var/cache/tenk via a shared system group, defined in the NixOS module. This is infrastructure legibility — the cache config is in one place, the permissions model is explicit, and an agent modifying the cache behavior can see the full picture in a single module file.

Deleted legacy tenk.nix. An older config file was using builtins.getFlake — an anti-pattern that bypasses the flake lock file and makes builds non-reproducible. It was replaced by proper NixOS modules months ago but never cleaned up. Removing it means there's one way to configure the system, not two. Fewer paths through the codebase means fewer ways for an agent to get confused.

Deferred EDGAR fetch. Completed comparisons were re-fetching SEC filings from EDGAR on every page load. The fix checks the cache first and only hits EDGAR when the data isn't already stored. Straightforward, but the agent found it, diagnosed it, and fixed it in a worktree — exactly the kind of targeted fix that agents handle well when the environment gives them clear signals about what's wrong.

PR-scoped API previews in CI. The PR pipeline now provisions a per-PR droplet + DNS record, deploys Nix closures, writes a preview env file, runs nixos-rebuild, and health-checks the preview API before deploying the frontend. This moved preview testing from "frontend-only" to full-stack.

Bootstrap and rebuild hardening. Fresh preview droplets can start as Ubuntu, so the workflow now auto-bootstraps NixOS with nixos-infect, adds swap on tiny preview instances to avoid OOM kills, and retries nixos-rebuild once when first-switch systemd units flap. This turned flaky first deploys into a repeatable path.

Token diagnostics and scope clarity. Cloudflare failures now fail loudly with explicit API diagnostics, and the required token shape is clear: DNS edit/read for zone automation plus Pages edit for frontend preview deploys.

Agent Legibility Is the Goal

OpenAI's insight that resonates most: "from the agent's point of view, anything it can't access in-context while running effectively doesn't exist." This is the design principle behind every infrastructure choice I've made.

NixOS configs are in the repo. Terraform state is in the repo. CLAUDE.md is the map — it describes the monorepo structure, the dev workflow, the skill system, the escalation rules. ADRs in .claude/docs/ record architectural decisions. A doc index tells the agent where to find reference material before making assumptions.

The agent can reason about the full system from git alone. It doesn't need tribal knowledge, Slack history, or access to a dashboard. If the agent can't find something in the repo, that's a failure of the harness, not the agent.

Martin Fowler made a related observation about the OpenAI paper: the real work of harness engineering is "designing environments, feedback loops, and control systems" — the rigor extends far beyond documentation. That matches my experience. The CLAUDE.md is the most visible piece, but the actual harness is the NixOS module system, the worktree workflow, the pre-commit hooks, the nix flake check validation. Documentation tells the agent what to do. Mechanical enforcement tells it when it's wrong.

What's Still Missing

I'm being honest about the gaps because the whole point of harness engineering is knowing what's built and what isn't.

Preview image strategy is still transitional. The workflow can bootstrap Ubuntu droplets into NixOS automatically, but the cleaner path is a true NixOS image ID as the default preview base. The fallback works; it just adds boot time and complexity.

Secrets are still manually managed. Preview env files are derived from production and patched for preview DB/CORS, but secret source-of-truth is still manual repo secret management. A SOPS/agenix path would make bootstrap safer and more declarative.

Credential ergonomics need stronger guardrails. Cloudflare token scope mismatches (DNS vs Pages vs account access) are now diagnosable, but policy-level validation before deploy would prevent rerun loops and reduce operator overhead.

Cost and lifecycle policy could be stricter. PR close tears previews down, but TTL-based janitor cleanup for abandoned branches would make the system more resilient.

These are all solvable. The point of listing them is that harness engineering is iterative. Each gap closed makes the agent more autonomous. The unified cache and the tenk.nix cleanup are small steps — they make the existing harness more legible, which makes the next step easier to build.

Boring Tech Is Agent-Legible Tech

OpenAI reached the same conclusion from the opposite direction. They enforce architectural boundaries through "deterministic linters and structural tests." They constrain the solution space: "specific architectural patterns, enforced boundaries, standardized structures." They fight entropy with periodic agents that detect inconsistencies.

All of that is a fancy way of saying: boring tech. NixOS is boring — purely functional, declarative, reproducible. Terraform is boring — state file, plan, apply. Pre-commit hooks are boring — run ruff, run the formatter, fail the commit if anything's wrong. Git worktrees are boring — just branches with isolated working directories.

Boring tech is legible tech. An agent can reason about a flake.nix because it's a pure function. An agent can reason about a Terraform plan because it's a diff. An agent can reason about a pre-commit hook because it's a pass/fail signal with an error message.

The clever stuff — the novel abstractions, the custom DSLs, the "elegant" solutions — is exactly what makes a system illegible to agents. If I can't explain it in a CLAUDE.md paragraph, an agent can't work with it reliably. Harness engineering is the discipline of keeping things simple enough that agents can be trusted with them.

That's the whole idea. Build the harness. Let the agent write the code.