This is the brief for the upstream platform team that calls our controller. It says, plainly: which RPCs work end-to-end today, which ones stub and reject with a documented reason, and what the agent will do once it runs on a real Linux host. Nothing in this document is aspirational — every "delivered" line is backed by a green test or a passing smoke script on `feat/v0-gold`.
The controller binary runs three gRPC listeners: :50051 for
frontpage + admin (TLS + bearer), :50052 for host agents
(strict mTLS, split per the Task 68 hardening), and :50053 for
one-shot enrollment. Every mutating RPC returns an async
Operation handle; callers poll GetOperation or
GetVM.current_operation. Idempotency is enforced in
Postgres via UNIQUE (vm_id, request_id). Errors carry
google.rpc.ErrorInfo with a closed Reason set.
Operation; worker resumable from step_state; rolls to deleted on half-provisioned failure.current_operation_id.updated_since= for 5-minute reconciliation backstop.smtp_egress_allowed, persisted + audited — agent nft half is Task 73, deferred).vms.state without reaching the VM.suspended. Restore from archived is explicitly rejected with restore_from_archived_not_wired — agent side is Task 54/80.audit_log.event_data for a VM — the only place upstream-supplied PII may land.OperationVerb from RESIZE so operators can filter on verb.ON CONFLICT (fingerprint) DO NOTHING tolerates replays.command_id = operations.id; agent ACKs, runs async, returns CommandResult (dispatcher always returns one — no indefinite blocks).The controller-side machinery that makes this boring at scale is in place: atomic operation pointers, resumable workers, idempotent enqueues, a PII-denylist invariant test, metric gauges, and a closed error-reason set that's already covered by a parser-based invariant test.
vms.regions · hosts · vms · vm_domains · operations · snapshots · bandwidth_ledger · audit_log · bootstrap_tokens + agent_certs · host-scope audit. Trigger-driven updated_at. Deferrable FK vms.current_operation_id → operations.id. A PII-denylist column invariant test asserts no PII column lands on vms / operations / snapshots / hosts / vm_domains / bandwidth_ledger.
Every lifecycle worker — Create, Suspend, Restore, Archive, Delete — short-circuits on already-terminal status, loads step_state at entry, and stamps each step before advancing. Mid-flight crash → River retry resumes from the last committed step. Archive no longer duplicates snapshot rows on retry. StartOperation is unique by (vm_id, request_id); vms.current_operation_id is flipped in the same tx; MarkOperationTerminal clears it in one tx.
Per Task 68 gold hardening, Enroll lives on :50053 (one-shot bootstrap token); :50052 carries Commands + Events bidi streams and RotateAgentCert under strict mTLS. Session authorisation does a fingerprint lookup with a revocation check (Task 71). The ComputeReconcile short-circuit on empty Inventory fixes the bug where every reconnect would have failed every running op.
/metrics gauge updater refreshes vms_total{state,region,flavor} and host_capacity_used_ratio{host_id,resource} every 30s. /healthz. gRPC reflection gated behind OPENBERTH_REFLECTION=true. systemd unit ships in deploy/; ExecReload=kill -HUP hot-reloads bearer tokens without dropping sessions.
Publisher has bounded retries; publishes vm.{created,suspended,restored,archived,deleted}. The contract is: push is best-effort, the platform reconciles every 5 minutes via ListVMs(updated_since=...) as the authoritative backstop. Tested across the retry matrix.
enroll + run subcommands. Local SQLite with a 7-day dedup cache (eviction at boot AND on daily ticker), event_buffer with correct seq round-trip on replay (regression test shipped), 10k cap with priority-drop, /22 IP allocator with 60s cooldown, dnsmasq per-VM dropins + metadata-hostname sinkhole base. Dispatcher populates dedup cache and always returns a CommandResult.
These items require the bare-metal Linux agent host the operator
hasn't provisioned yet. The controller side of each is complete.
Each row below is the agent-side gap — libvirt, nftables, dnsmasq,
Caddy, genisoimage, and an R2 bucket is what flips them
to green.
restore_from_archived_not_wired until this ships.Inventory{} today; controller's reconcile explicitly tolerates it.These are v1 work. The controller rejects them today with a documented reason so the platform client can switch on the reason code and surface the right message to its users.
These are the cross-cutting shapes that do not change without a
design-review conversation. Most have a corresponding invariant test
on feat/v0-gold.
vms, ever.Only audit_log.event_data may hold upstream-supplied PII — scrubbable via ScrubAuditEventData.suspended. Delete isn't terminal until physical purge is verified.authorization metadata is never logged.Middleware + a unit test that asserts no log line carries the bearer prefix.reason is a closed, versioned set.Additive only; existing reasons never change meaning. Invariant test parses the entire api/ package.Operation; idempotent via UNIQUE (vm_id, request_id).vms.state holds only destinations — active / suspended / archived / deleted. Transient verbs live on operations.status.vms.id (our UUID, primary) and workspace_id (upstream's identifier, required + immutable at create).Snapshot of the tag. Re-runnable from the branch.
smoke-session: RegisterHost → enroll → mTLS session with strict
cert-fingerprint + revocation check → heartbeat recorded in
hosts.last_heartbeat_at.
smoke-mismatch: region/provider mismatch correctly returns
FailedPrecondition with region_provider_mismatch.
These are the shapes to code your client against. They will not shift in v0.x.
Don't block on the RPC. Call the verb, take the Operation, and poll GetOperation or watch GetVM.current_operation. The server is authoritative; the poll is cheap.
request_id; get the same Operation.Retries are free. We enforce UNIQUE (vm_id, request_id). If you retry after a network blip, you'll get the same operation handle, not a duplicate verb.
ErrorInfo.reason, never on the message.Messages are for humans; reasons are for clients. The set is closed, additive-only. If you see restore_from_archived_not_wired today, that's us telling you exactly what's still in flight.
ListVMs(updated_since=…) is the backstop.CF Queue pushes are best-effort. The authoritative source of truth for platform state is ListVMs with an updated_since cursor. Drive your sync loop from it.
Each spec opens with Status / Date / Scope / Depends on. The Out of scope section at the end of every spec is load-bearing — it's the explicit boundary with the siblings.