01Overview
Autopilot encodes the on-call runbook as code. It observes Prometheus alerts, pod conditions, and PVC health, then drives remediation with an explicit, auditable state machine.
Every action is dry-run-able and emits a diff before applying. Engineers can promote a dry run to live remediation with a single Slack approval click.
02The Problem
Stateful services were eating 40% of on-call time, and the runbook lived in a Notion page that hadn't been updated since the last team rotation.
03Approach
- Modeled remediation as a state machine in Go with structured logging into Loki.
- Implemented a policy layer that classifies actions by blast radius and routes high-risk ones through Slack-based approval.
- Wrote a Helm-shipped CRD set so any team could adopt it incrementally per workload.
04Outcome
On-call paging volume cut by 62% across stateful services.
Mean time to recover dropped from 23 minutes to 4 minutes for the top five incident classes.