openberth-infra
Branch feat/v0-gold Tag v0-gold 2026-04-21
Platform delivery brief · v0 · Gold

The controller is ready.
The host agent is waiting on a Linux box.

This is the brief for the upstream platform team that calls our controller. It says, plainly: which RPCs work end-to-end today, which ones stub and reject with a documented reason, and what the agent will do once it runs on a real Linux host. Nothing in this document is aspirational — every "delivered" line is backed by a green test or a passing smoke script on `feat/v0-gold`.

86
commits on v0-gold
75+
unit + integration tests
19
RPCs wired live
10
schema migrations
2
smoke scripts passing
01 · Orientation

What the platform team can call, today.

The controller binary runs three gRPC listeners: :50051 for frontpage + admin (TLS + bearer), :50052 for host agents (strict mTLS, split per the Task 68 hardening), and :50053 for one-shot enrollment. Every mutating RPC returns an async Operation handle; callers poll GetOperation or GetVM.current_operation. Idempotency is enforced in Postgres via UNIQUE (vm_id, request_id). Errors carry google.rpc.ErrorInfo with a closed Reason set.

VMService · :50051 · TLS + bearer
RegisterHost
Admin · registers a host, returns bootstrap enrollment token.
admin scope
Live
CreateVM
Provisions a VM; returns Operation; worker resumable from step_state; rolls to deleted on half-provisioned failure.
async op
Live
GetVM
Returns VM by id or workspace_id; exposes current_operation_id.
read
Live
ListVMs
Opaque forward cursor; supports updated_since= for 5-minute reconciliation backstop.
read · paginated
Live
UpdateVM
In-place mutable fields (including smtp_egress_allowed, persisted + audited — agent nft half is Task 73, deferred).
sync + audit
Live
SuspendVM
Suspends; edge Caddy serves suspended page from vms.state without reaching the VM.
async op
Live
RestoreVM
Restores from suspended. Restore from archived is explicitly rejected with restore_from_archived_not_wired — agent side is Task 54/80.
async op
Live
ArchiveVM
Snapshot-then-verify-then-wipe. Resumable worker — retry after crash will not duplicate snapshot rows.
async op
Live
DeleteVM
Not marked terminal until physical purge is verified.
async op
Live
GetOperation
Poll handle for any async op.
read
Live
ListOperations
Cursor-paginated; filter by VM, verb, status.
read · paginated
Live
ListSnapshots
Cursor-paginated; honors retention column.
read · paginated
Live
GetHost / ListHosts
Admin-gated; host capacity + heartbeat.
admin scope
Live
ScrubAuditEventData
GDPR lever. Zeroes audit_log.event_data for a VM — the only place upstream-supplied PII may land.
admin scope
Live
VMService · stubs that reject with documented Reason
ResizeVM
Externally one verb; in-place vs migrate path is internal. Rejects until Linux-host phase.
reason: not_implemented
Stub · rejects
MigrateVM
Distinct OperationVerb from RESIZE so operators can filter on verb.
reason: not_implemented
Stub · rejects
EvacuateHost / DrainHost / RetireHost
No auto-evacuation in v0; all operator-initiated rejects.
reason: not_implemented
Stub · rejects
AddCustomDomain / RemoveCustomDomain
Default subdomain rows exist; custom domains + ACME DNS-01 are v1.
reason: not_implemented
Stub · rejects
EnrollmentService · :50053 · one-shot bootstrap
Enroll
Bootstrap-token → signs CSR → returns mTLS cert. ON CONFLICT (fingerprint) DO NOTHING tolerates replays.
bootstrap tokens
Live
AgentService · :50052 · strict mTLS (fingerprint + revocation)
Commands (bidi)
Controller→agent commands with command_id = operations.id; agent ACKs, runs async, returns CommandResult (dispatcher always returns one — no indefinite blocks).
bidi stream
Live
Events (bidi)
Agent→controller progress + inventory. Event buffer preserves seq on replay (fixed post-review).
bidi stream
Live
RotateAgentCert
Agent rotates at T-60d; daily ticker.
mTLS
Live
02 · Under the hood

The plumbing is done — and it's the part you'd build last.

The controller-side machinery that makes this boring at scale is in place: atomic operation pointers, resumable workers, idempotent enqueues, a PII-denylist invariant test, metric gauges, and a closed error-reason set that's already covered by a parser-based invariant test.

Schema · postgres

10 migrations, one source of truth, zero PII on vms.

regions · hosts · vms · vm_domains · operations · snapshots · bandwidth_ledger · audit_log · bootstrap_tokens + agent_certs · host-scope audit. Trigger-driven updated_at. Deferrable FK vms.current_operation_id → operations.id. A PII-denylist column invariant test asserts no PII column lands on vms / operations / snapshots / hosts / vm_domains / bandwidth_ledger.

trigger-driven updated_atdeferrable FKPII-denylist test
River workers · async ops

Resumable, idempotent, atomic pointer.

Every lifecycle worker — Create, Suspend, Restore, Archive, Delete — short-circuits on already-terminal status, loads step_state at entry, and stamps each step before advancing. Mid-flight crash → River retry resumes from the last committed step. Archive no longer duplicates snapshot rows on retry. StartOperation is unique by (vm_id, request_id); vms.current_operation_id is flipped in the same tx; MarkOperationTerminal clears it in one tx.

resumableidempotentno leaked capacity
mTLS · agent plane

Enrollment split off; session auth by fingerprint + revocation.

Per Task 68 gold hardening, Enroll lives on :50053 (one-shot bootstrap token); :50052 carries Commands + Events bidi streams and RotateAgentCert under strict mTLS. Session authorisation does a fingerprint lookup with a revocation check (Task 71). The ComputeReconcile short-circuit on empty Inventory fixes the bug where every reconnect would have failed every running op.

split portsfingerprint + revocationempty-inventory safe
Observability · ops + metrics

Prometheus + a hot-reloadable operator surface.

/metrics gauge updater refreshes vms_total{state,region,flavor} and host_capacity_used_ratio{host_id,resource} every 30s. /healthz. gRPC reflection gated behind OPENBERTH_REFLECTION=true. systemd unit ships in deploy/; ExecReload=kill -HUP hot-reloads bearer tokens without dropping sessions.

prom gaugesHUP reloadreflection gated
Events · upstream notify

Best-effort Cloudflare Queue, nil-safe when unconfigured.

Publisher has bounded retries; publishes vm.{created,suspended,restored,archived,deleted}. The contract is: push is best-effort, the platform reconciles every 5 minutes via ListVMs(updated_since=...) as the authoritative backstop. Tested across the retry matrix.

best-effort push5-min reconcile backstopnil-safe
Agent · darwin binary

SQLite dedup, buffered events, /22 IP allocator.

enroll + run subcommands. Local SQLite with a 7-day dedup cache (eviction at boot AND on daily ticker), event_buffer with correct seq round-trip on replay (regression test shipped), 10k cap with priority-drop, /22 IP allocator with 60s cooldown, dnsmasq per-VM dropins + metadata-hostname sinkhole base. Dispatcher populates dedup cache and always returns a CommandResult.

darwin verifiedseq replay testedlinux libvirt pending
03 · Waiting on hardware

What the Linux host will turn on.

These items require the bare-metal Linux agent host the operator hasn't provisioned yet. The controller side of each is complete. Each row below is the agent-side gap — libvirt, nftables, dnsmasq, Caddy, genisoimage, and an R2 bucket is what flips them to green.

Task 40
libvirt domain + cloud-init seed — the primitive that every lifecycle verb needs.
libvirt · genisoimage
Task 41
Caddy edge route client (agent-side) — the controller already persists edge intent.
caddy · admin api
Task 42
ProvisionVM handler wired to libvirt — the Create worker's terminal step.
libvirt
Task 43
Golden Debian 12 qcow2 prep script — base image for the fleet.
qemu · debian 12
Task 44
Edge state tracker — multi-VM Caddy config coherence across restarts.
caddy · sqlite
Task 45
R2 snapshot client — upload / download / checksum of archive artifacts.
r2 · s3-compat
Task 54
Agent RestoreFromSnapshot staging — controller explicitly returns restore_from_archived_not_wired until this ships.
r2 · libvirt
Task 60
Agent Inventory scan — agent sends a placeholder Inventory{} today; controller's reconcile explicitly tolerates it.
libvirt list
Task 73
Host-global nftables ruleset — SMTP egress flag is persisted + audited on the controller; agent nft set manager is the other half.
nftables
Task 80
Real RestoreFromSnapshot handler — the agent-side worker that reverses archive.
r2 · libvirt
Task 81
Periodic snapshot worker (controller scheduler) — snapshots table + retention column exist and are honored by code paths.
river scheduler
Task 83
OpenTelemetry tracing plumbing across gRPC + River.
otel sdk
Task 87
ACME DNS-01 via Cloudflare — wildcard certs for the agent Caddy.
acme · cf dns
Task 89
drill-archive-rollback.sh — controller rollback machinery exists (snapshot-then-verify-then-wipe enforced); the drill needs a Linux agent to actually break.
linux agent
04 · Out of scope for v0

Not in this release — and stubbed to reject, not silently succeed.

These are v1 work. The controller rejects them today with a documented reason so the platform client can switch on the reason code and surface the right message to its users.

ResizeVM MigrateVM EvacuateHost DrainHost RetireHost AddCustomDomain RemoveCustomDomain Multi-region Multi-provider Per-VM IPv6 HA controller / multi-replica Bandwidth counter writers scripts/demo.sh end-to-end
05 · Invariants preserved

The contract the platform can build against.

These are the cross-cutting shapes that do not change without a design-review conversation. Most have a corresponding invariant test on feat/v0-gold.

i
No PII on vms, ever.Only audit_log.event_data may hold upstream-supplied PII — scrubbable via ScrubAuditEventData.
ii
Archive is snapshot-then-verify-then-wipe.Mid-flight failure rolls back to suspended. Delete isn't terminal until physical purge is verified.
iii
No auto-evacuation, no auto-fallback to older snapshots, no auto-destroy of orphan VMs.Data-loss-adjacent actions are always operator-initiated.
iv
authorization metadata is never logged.Middleware + a unit test that asserts no log line carries the bearer prefix.
v
Admin-scoped RPCs check scope before any data access.Enforced by interceptor.
vi
Error reason is a closed, versioned set.Additive only; existing reasons never change meaning. Invariant test parses the entire api/ package.
vii
Pagination cursors are opaque, forward-only.No total count; no "go to page N". Covered by the cursor-walk test.
viii
Every mutating RPC is async.Returns an Operation; idempotent via UNIQUE (vm_id, request_id).
ix
vms.state holds only destinationsactive / suspended / archived / deleted. Transient verbs live on operations.status.
x
Two IDs per VM.vms.id (our UUID, primary) and workspace_id (upstream's identifier, required + immutable at create).
06 · Green light

Test + smoke status at v0-gold.

Snapshot of the tag. Re-runnable from the branch.

go test ./apps/...PASS
go test -tags=integration ./apps/...PASS
scripts/smoke-session.shPASS
scripts/smoke-mismatch.shPASS

smoke-session: RegisterHost → enroll → mTLS session with strict cert-fingerprint + revocation check → heartbeat recorded in hosts.last_heartbeat_at.   smoke-mismatch: region/provider mismatch correctly returns FailedPrecondition with region_provider_mismatch.

07 · For the platform team

Four things to expect when you integrate.

These are the shapes to code your client against. They will not shift in v0.x.

01 · async by default

Every mutation returns an Operation. Poll it.

Don't block on the RPC. Call the verb, take the Operation, and poll GetOperation or watch GetVM.current_operation. The server is authoritative; the poll is cheap.

02 · idempotent by request_id

Send the same request_id; get the same Operation.

Retries are free. We enforce UNIQUE (vm_id, request_id). If you retry after a network blip, you'll get the same operation handle, not a duplicate verb.

03 · reason codes are stable

Switch on ErrorInfo.reason, never on the message.

Messages are for humans; reasons are for clients. The set is closed, additive-only. If you see restore_from_archived_not_wired today, that's us telling you exactly what's still in flight.

04 · reconcile, don't trust push

5-minute ListVMs(updated_since=…) is the backstop.

CF Queue pushes are best-effort. The authoritative source of truth for platform state is ListVMs with an updated_since cursor. Drive your sync loop from it.

08 · Read the specs

Canonical sources for anything this page summarised.

Each spec opens with Status / Date / Scope / Depends on. The Out of scope section at the end of every spec is load-bearing — it's the explicit boundary with the siblings.