Metrics
chaos_zookoo exposes Prometheus metrics on METRICS_ADDR (default
:9090) at /metrics. All metrics are prefixed with chaos_ so they
join naturally on a single dashboard variable.
Module lifecycle metrics
These metrics are emitted by the orchestrator for every registered module, regardless of kind.
chaos_module_info
chaos_module_info{name, kind, namespace, schedule_type, schedule_value} = 1
| Type | Gauge (static) |
|---|
Always 1. Registered at startup from orch.Register(...). Useful as a
join key in Grafana — filter by name to pull the module's label set
into other queries.
schedule_type is "periodic", "cron", or "once". schedule_value
is the interval string or cron expression.
chaos_module_runs_total
chaos_module_runs_total{name, kind, namespace, status="success"|"error"}
| Type | Counter |
|---|
Incremented after each module.Run() call completes. status=error
means Run returned a non-nil error.
chaos_module_last_run_timestamp_seconds
chaos_module_last_run_timestamp_seconds{name, kind, namespace}
| Type | Gauge (Unix seconds) |
|---|
Set to time.Now().Unix() at the start of each Run() call. Use
time() - chaos_module_last_run_timestamp_seconds to detect stalled
modules.
chaos_module_run_duration_seconds
chaos_module_run_duration_seconds_bucket{name, kind, namespace, le="..."}
| Type | Histogram (default prometheus.DefBuckets) |
|---|
Wall-clock duration of each Run() call. Useful to detect modules
blocking shutdown or unexpectedly slow API calls.
chaos_pods_affected_total
chaos_pods_affected_total{name, kind, namespace}
| Type | Counter |
|---|
Incremented once per pod successfully removed (Killing) or deleted
(GorillaKill). Rolls up across all ticks.
Middleware metrics
chaos_test_success
chaos_test_success{name="<module>"}
| Type | Gauge |
|---|---|
| Owner | testkit |
Outcome of the last testkit evaluation for the named module:
1— query succeeded and the operator/threshold assertion held.0— query failed, no querier was configured, or the assertion failed.
The series is absent until the module has run at least once.
chaos_loading_http_active
chaos_loading_http_active{name="<module>", method="GET|POST", url="..."}
| Type | Gauge |
|---|---|
| Owner | loadkit |
1 while a load burst is firing for the module, 0 otherwise. Useful
to overlay on graphs to see when load was active.
chaos_load_requests_total
chaos_load_requests_total{name, method, url, status="2xx|3xx|4xx|5xx|1xx|error"}
| Type | Counter |
|---|---|
| Owner | loadkit |
Total HTTP requests fired by a load burst, bucketed by status class.
error covers network failures and context cancellation.
chaos_load_request_duration_seconds
chaos_load_request_duration_seconds_bucket{name, method, url, le="..."}
| Type | Histogram (default prometheus.DefBuckets) |
|---|---|
| Owner | loadkit |
Wall-clock latency of each load request. Histogram uses the default Prometheus buckets (5ms → 10s).
Suggested queries
Detect a stalled module (hasn't run in 15 min)
time() - chaos_module_last_run_timestamp_seconds > 900
Error rate per module over 1 hour
sum by (name, kind) (
increase(chaos_module_runs_total{status="error"}[1h])
)
Total pods killed today, by module
increase(chaos_pods_affected_total[24h])
p95 run duration by module
histogram_quantile(
0.95,
sum by (name, le) (rate(chaos_module_run_duration_seconds_bucket[30m]))
)
Success rate of chaos checks over the last 24h, by module
avg_over_time(chaos_test_success[24h])
RPS generated by a load burst, by status class
sum by (status) (rate(chaos_load_requests_total{name="kill-api"}[1m]))
p95 latency under load
histogram_quantile(
0.95,
sum by (le) (rate(chaos_load_request_duration_seconds_bucket[5m]))
)
5xx ratio during a load burst
sum(rate(chaos_load_requests_total{status="5xx"}[1m]))
/
sum(rate(chaos_load_requests_total[1m]))
and on() (chaos_loading_http_active == 1)
Alerting patterns
chaos_test_success == 0— the scenario ran, the SLO was breached. Page.absent_over_time(chaos_test_success{name="..."}[2h])— the scenario hasn't run in 2h. The agent is down or misconfigured.chaos_module_runs_total{status="error"} > 0— at least one run failed. Check the agent logs for the module.chaos_loading_http_active == 1 and on() <your 5xx alert>— cross-check: a 5xx spike only during a load burst is a discovery, not an outage.
Registering custom metrics
Metrics are registered in pkg/metrics/metrics.go in a package-level
init(). To add a new signal:
- Declare the metric as a package-level
var. prometheus.MustRegister(...)it ininit().- Import
pkg/metricsfrom the module or middleware that owns the signal and update it inline.
Do not import github.com/prometheus/client_golang directly from
modules — always go through pkg/metrics. This keeps the registry
surface discoverable.