Skip to main content
Build resilience · Kubernetes chaos engineering

Chaos engineering
with your RBAC

RBAC-native Kubernetes chaos engineering, driven from the API server

killing.yaml
kind: Killing
name: frontend-pod-killer
namespace: production
schedule:
interval: 5m
initialDelay: 30s
selector:
labels:
app: frontend
tier: web
minAvailable: 2
dryRun: false

Rehearse the failure before production does it for you.

Every incident you ship to prod is a rehearsal you skipped. chaos_zookoo turns crash and recovery scenarios into versioned, reproducible YAML — so you can prove your workloads survive the fault in staging, on a schedule, with an auditable pass/fail signal, long before a pager wakes anyone up.

  1. 01

    Describe the failure

    Pick a workload, a fault kind (kill, mass kill, restart), a cadence, and a safety floor. One YAML doc per scenario — readable by anyone on the team.

  2. 02

    Run & observe

    Fire synthetic traffic during the disruption, then query Prometheus to assert the SLO held. Results land in your existing Grafana dashboards.

  3. 03

    Ship with evidence

    A green chaos_test_success is a reproducible signal that the workload recovers. Gate your release on it — not on hope.

Precision chaos, minimal footprint

A single long-running process authenticated as a ServiceAccount. No custom resources, no operator, no privileged components.

RBAC-native security

No cluster-admin required. The ServiceAccount RBAC is the security model — grant exactly the chaos permissions you intend, nothing more.

API-server only

Every disruption is a regular API call — EvictV1, Pods.Delete, Deployments.Patch. No privileged nodes, no DaemonSets, no sidecars.

YAML-driven config

Declare scenarios in familiar YAML. Each document maps to a module — Killing, GorillaKill, Rollout — with its own schedule and selectors.

Composable middlewares

Wrap any module with synthetic HTTP load generation and post-run Prometheus assertions — without touching the module code.

Measurable resilience

Each passing run is a proof point, not an opinion. Accumulate a versioned track record of recovery — and gate releases on it instead of hope.

Ready to break things safely?

Follow the installation guide and run your first chaos scenario in minutes.