resham_acharya.exe000%
Compiling experience...
All Projects
DevOps2024 · Lead Engineer

Self-healing Kubernetes operator for stateful workloads

01Overview

Autopilot encodes the on-call runbook as code. It observes Prometheus alerts, pod conditions, and PVC health, then drives remediation with an explicit, auditable state machine.

Every action is dry-run-able and emits a diff before applying. Engineers can promote a dry run to live remediation with a single Slack approval click.

02The Problem

Stateful services were eating 40% of on-call time, and the runbook lived in a Notion page that hadn't been updated since the last team rotation.

03Approach
  • Modeled remediation as a state machine in Go with structured logging into Loki.
  • Implemented a policy layer that classifies actions by blast radius and routes high-risk ones through Slack-based approval.
  • Wrote a Helm-shipped CRD set so any team could adopt it incrementally per workload.
04Outcome
On-call paging volume cut by 62% across stateful services.
Mean time to recover dropped from 23 minutes to 4 minutes for the top five incident classes.