K8s Autopilot

01Overview

Autopilot encodes the on-call runbook as code. It observes Prometheus alerts, pod conditions, and PVC health, then drives remediation with an explicit, auditable state machine.

Every action is dry-run-able and emits a diff before applying. Engineers can promote a dry run to live remediation with a single Slack approval click.

02The Problem

Stateful services were eating 40% of on-call time, and the runbook lived in a Notion page that hadn't been updated since the last team rotation.

03Approach

Modeled remediation as a state machine in Go with structured logging into Loki.
Implemented a policy layer that classifies actions by blast radius and routes high-risk ones through Slack-based approval.
Wrote a Helm-shipped CRD set so any team could adopt it incrementally per workload.

04Outcome

On-call paging volume cut by 62% across stateful services.

Mean time to recover dropped from 23 minutes to 4 minutes for the top five incident classes.

DevForge

Synapse

QueryLens