Skip to main content

Metrics

chaos_zookoo exposes Prometheus metrics on METRICS_ADDR (default :9090) at /metrics. All metrics are prefixed with chaos_ so they join naturally on a single dashboard variable.

Module lifecycle metrics

These metrics are emitted by the orchestrator for every registered module, regardless of kind.

chaos_module_info

chaos_module_info{name, kind, namespace, schedule_type, schedule_value} = 1
TypeGauge (static)

Always 1. Registered at startup from orch.Register(...). Useful as a join key in Grafana — filter by name to pull the module's label set into other queries.

schedule_type is "periodic", "cron", or "once". schedule_value is the interval string or cron expression.

chaos_module_runs_total

chaos_module_runs_total{name, kind, namespace, status="success"|"error"}
TypeCounter

Incremented after each module.Run() call completes. status=error means Run returned a non-nil error.

chaos_module_last_run_timestamp_seconds

chaos_module_last_run_timestamp_seconds{name, kind, namespace}
TypeGauge (Unix seconds)

Set to time.Now().Unix() at the start of each Run() call. Use time() - chaos_module_last_run_timestamp_seconds to detect stalled modules.

chaos_module_run_duration_seconds

chaos_module_run_duration_seconds_bucket{name, kind, namespace, le="..."}
TypeHistogram (default prometheus.DefBuckets)

Wall-clock duration of each Run() call. Useful to detect modules blocking shutdown or unexpectedly slow API calls.

chaos_pods_affected_total

chaos_pods_affected_total{name, kind, namespace}
TypeCounter

Incremented once per pod successfully removed (Killing) or deleted (GorillaKill). Rolls up across all ticks.

Middleware metrics

chaos_test_success

chaos_test_success{name="<module>"}
TypeGauge
Ownertestkit

Outcome of the last testkit evaluation for the named module:

  • 1 — query succeeded and the operator/threshold assertion held.
  • 0 — query failed, no querier was configured, or the assertion failed.

The series is absent until the module has run at least once.

chaos_loading_http_active

chaos_loading_http_active{name="<module>", method="GET|POST", url="..."}
TypeGauge
Ownerloadkit

1 while a load burst is firing for the module, 0 otherwise. Useful to overlay on graphs to see when load was active.

chaos_load_requests_total

chaos_load_requests_total{name, method, url, status="2xx|3xx|4xx|5xx|1xx|error"}
TypeCounter
Ownerloadkit

Total HTTP requests fired by a load burst, bucketed by status class. error covers network failures and context cancellation.

chaos_load_request_duration_seconds

chaos_load_request_duration_seconds_bucket{name, method, url, le="..."}
TypeHistogram (default prometheus.DefBuckets)
Ownerloadkit

Wall-clock latency of each load request. Histogram uses the default Prometheus buckets (5ms → 10s).

Suggested queries

Detect a stalled module (hasn't run in 15 min)

time() - chaos_module_last_run_timestamp_seconds > 900

Error rate per module over 1 hour

sum by (name, kind) (
increase(chaos_module_runs_total{status="error"}[1h])
)

Total pods killed today, by module

increase(chaos_pods_affected_total[24h])

p95 run duration by module

histogram_quantile(
0.95,
sum by (name, le) (rate(chaos_module_run_duration_seconds_bucket[30m]))
)

Success rate of chaos checks over the last 24h, by module

avg_over_time(chaos_test_success[24h])

RPS generated by a load burst, by status class

sum by (status) (rate(chaos_load_requests_total{name="kill-api"}[1m]))

p95 latency under load

histogram_quantile(
0.95,
sum by (le) (rate(chaos_load_request_duration_seconds_bucket[5m]))
)

5xx ratio during a load burst

sum(rate(chaos_load_requests_total{status="5xx"}[1m]))
/
sum(rate(chaos_load_requests_total[1m]))
and on() (chaos_loading_http_active == 1)

Alerting patterns

  • chaos_test_success == 0 — the scenario ran, the SLO was breached. Page.
  • absent_over_time(chaos_test_success{name="..."}[2h]) — the scenario hasn't run in 2h. The agent is down or misconfigured.
  • chaos_module_runs_total{status="error"} > 0 — at least one run failed. Check the agent logs for the module.
  • chaos_loading_http_active == 1 and on() <your 5xx alert> — cross-check: a 5xx spike only during a load burst is a discovery, not an outage.

Registering custom metrics

Metrics are registered in pkg/metrics/metrics.go in a package-level init(). To add a new signal:

  1. Declare the metric as a package-level var.
  2. prometheus.MustRegister(...) it in init().
  3. Import pkg/metrics from the module or middleware that owns the signal and update it inline.

Do not import github.com/prometheus/client_golang directly from modules — always go through pkg/metrics. This keeps the registry surface discoverable.