Metrics

chaos_zookoo exposes Prometheus metrics on METRICS_ADDR (default :9090) at /metrics. All metrics are prefixed with chaos_ so they join naturally on a single dashboard variable.

Module lifecycle metrics

These metrics are emitted by the orchestrator for every registered module, regardless of kind.

`chaos_module_info`

chaos_module_info{name, kind, namespace, schedule_type, schedule_value} = 1

Type	Gauge (static)

Always 1. Registered at startup from orch.Register(...). Useful as a join key in Grafana — filter by name to pull the module's label set into other queries.

schedule_type is "periodic", "cron", or "once". schedule_value is the interval string or cron expression.

`chaos_module_runs_total`

chaos_module_runs_total{name, kind, namespace, status="success"|"error"}

Type	Counter

Incremented after each module.Run() call completes. status=error means Run returned a non-nil error.

`chaos_module_last_run_timestamp_seconds`

chaos_module_last_run_timestamp_seconds{name, kind, namespace}

Type	Gauge (Unix seconds)

Set to time.Now().Unix() at the start of each Run() call. Use time() - chaos_module_last_run_timestamp_seconds to detect stalled modules.

`chaos_module_run_duration_seconds`

chaos_module_run_duration_seconds_bucket{name, kind, namespace, le="..."}

Type	Histogram (default `prometheus.DefBuckets`)

Wall-clock duration of each Run() call. Useful to detect modules blocking shutdown or unexpectedly slow API calls.

`chaos_pods_affected_total`

chaos_pods_affected_total{name, kind, namespace}

Type	Counter

Incremented once per pod successfully removed (Killing) or deleted (GorillaKill). Rolls up across all ticks.

Middleware metrics

`chaos_test_success`

chaos_test_success{name="<module>"}

Type	Gauge
Owner	testkit

Outcome of the last testkit evaluation for the named module:

1 — query succeeded and the operator/threshold assertion held.
0 — query failed, no querier was configured, or the assertion failed.

The series is absent until the module has run at least once.

`chaos_loading_http_active`

chaos_loading_http_active{name="<module>", method="GET|POST", url="..."}

Type	Gauge
Owner	loadkit

1 while a load burst is firing for the module, 0 otherwise. Useful to overlay on graphs to see when load was active.

`chaos_load_requests_total`

chaos_load_requests_total{name, method, url, status="2xx|3xx|4xx|5xx|1xx|error"}

Type	Counter
Owner	loadkit

Total HTTP requests fired by a load burst, bucketed by status class. error covers network failures and context cancellation.

`chaos_load_request_duration_seconds`

chaos_load_request_duration_seconds_bucket{name, method, url, le="..."}

Type	Histogram (default `prometheus.DefBuckets`)
Owner	loadkit

Wall-clock latency of each load request. Histogram uses the default Prometheus buckets (5ms → 10s).

Suggested queries

Detect a stalled module (hasn't run in 15 min)

time() - chaos_module_last_run_timestamp_seconds > 900

Error rate per module over 1 hour

sum by (name, kind) (
  increase(chaos_module_runs_total{status="error"}[1h])
)

Total pods killed today, by module

increase(chaos_pods_affected_total[24h])

p95 run duration by module

histogram_quantile(
  0.95,
  sum by (name, le) (rate(chaos_module_run_duration_seconds_bucket[30m]))
)

Success rate of chaos checks over the last 24h, by module

avg_over_time(chaos_test_success[24h])

RPS generated by a load burst, by status class

sum by (status) (rate(chaos_load_requests_total{name="kill-api"}[1m]))

p95 latency under load

histogram_quantile(
  0.95,
  sum by (le) (rate(chaos_load_request_duration_seconds_bucket[5m]))
)

5xx ratio during a load burst

sum(rate(chaos_load_requests_total{status="5xx"}[1m]))
  /
sum(rate(chaos_load_requests_total[1m]))
  and on() (chaos_loading_http_active == 1)

Alerting patterns

chaos_test_success == 0 — the scenario ran, the SLO was breached. Page.
absent_over_time(chaos_test_success{name="..."}[2h]) — the scenario hasn't run in 2h. The agent is down or misconfigured.
chaos_module_runs_total{status="error"} > 0 — at least one run failed. Check the agent logs for the module.
chaos_loading_http_active == 1 and on() <your 5xx alert> — cross-check: a 5xx spike only during a load burst is a discovery, not an outage.

Registering custom metrics

Metrics are registered in pkg/metrics/metrics.go in a package-level init(). To add a new signal:

Declare the metric as a package-level var.
prometheus.MustRegister(...) it in init().
Import pkg/metrics from the module or middleware that owns the signal and update it inline.

Do not import github.com/prometheus/client_golang directly from modules — always go through pkg/metrics. This keeps the registry surface discoverable.

Module lifecycle metrics​

chaos_module_info​

chaos_module_runs_total​

chaos_module_last_run_timestamp_seconds​

chaos_module_run_duration_seconds​

chaos_pods_affected_total​

Middleware metrics​

chaos_test_success​

chaos_loading_http_active​

chaos_load_requests_total​

chaos_load_request_duration_seconds​

Suggested queries​

Detect a stalled module (hasn't run in 15 min)​

Error rate per module over 1 hour​

Total pods killed today, by module​

p95 run duration by module​

Success rate of chaos checks over the last 24h, by module​

RPS generated by a load burst, by status class​

p95 latency under load​

5xx ratio during a load burst​

Alerting patterns​

Registering custom metrics​