Setting Guardrail Metrics to Auto-Halt Experiments

This how-to is part of Experimentation & A/B Testing Guardrails. It solves a specific failure mode: an experiment’s treatment variant is actively harming a core metric — error rate climbing, latency spiking, checkout abandonment rising — but the experiment runs uninterrupted until someone manually notices. By the time the post-mortem runs, tens of thousands of users have experienced a degraded product.

The fix is a guardrail monitor: a lightweight comparison loop that checks treatment metrics against control metrics on a short interval and disables the experiment flag automatically when a threshold is breached. This how-to covers defining the thresholds, wiring the comparison, triggering an automatic halt, and alerting the team. It does not cover the statistical framework for choosing success metrics or sample sizes — those are in the experimentation guardrails overview.

Metric threshold breach leading to auto-halt sequence A monitor queries per-variant metrics, computes relative delta against the threshold, triggers a flag halt when the threshold is exceeded, and sends an alert. Monitor query every 2 min Compare delta treatment vs control vs threshold breach Halt flag force control variant + record reason Alert PagerDuty / Slack + metric values
The monitor queries per-variant metrics every two minutes; a threshold breach triggers an immediate flag halt and routes an alert with the offending metric values.

Prerequisites

Step-by-Step Procedure

Step 1 — Define guardrail metrics and thresholds

Before the experiment starts, write down every metric you are not allowed to harm and the maximum acceptable degradation. Commit these as structured metadata on the flag object so they are in the audit trail and are visible to anyone who reads the flag definition.

# flagd flag definition with guardrail metadata
flags:
  checkout.payments.express-pay:
    state: ENABLED
    variants:
      "treatment": true
      "control": false
    defaultVariant: "control"
    targeting:
      fractionalEvaluation:
        - { "var": "targetingKey" }
        - ["treatment", 50]
        - ["control",  50]
    metadata:
      owner: "payments-team"
      expiry: "2026-07-20"
      experiment_type: "ab"
      guardrails:
        - metric: "error_rate"
          comparison: "relative_increase"
          threshold: 0.01          # halt if treatment error rate > control + 1%
        - metric: "p95_latency_ms"
          comparison: "relative_increase"
          threshold: 0.10          # halt if p95 > control + 10%
        - metric: "cart_abandonment_rate"
          comparison: "relative_increase"
          threshold: 0.02          # halt if abandonment > control + 2pp

Do not set thresholds tighter than your normal day-to-day metric variance. If your error rate fluctuates ±0.5% without any experiment, a 0.3% threshold will fire constantly on noise. Baseline your metrics for at least a week before setting guardrail values.

Step 2 — Wire a monitor that compares variant metrics

Build a monitor that queries per-variant aggregates on a short, fixed interval. Two minutes is a reasonable starting point — short enough to catch a fast-moving regression, long enough for the aggregation window to smooth out single-request spikes.

# guardrail_monitor.py
import asyncio, httpx, os, sys, json, math

FLAG_KEY = "checkout.payments.express-pay"
FLAG_API = os.environ["FLAG_API_URL"]
FLAG_TOKEN = os.environ["FLAG_API_TOKEN"]
METRICS_API = os.environ["METRICS_API_URL"]
ALERT_WEBHOOK = os.environ["ALERT_WEBHOOK_URL"]

GUARDRAILS = [
    {"metric": "error_rate",           "threshold": 0.01},
    {"metric": "p95_latency_ms",       "threshold": 0.10},
    {"metric": "cart_abandonment_rate","threshold": 0.02},
]
MIN_OBSERVATIONS = 100  # do not fire before this many events per variant per metric


async def fetch_metric(client: httpx.AsyncClient, metric: str, variant: str) -> dict:
    resp = await client.get(
        f"{METRICS_API}/query",
        params={"flag": FLAG_KEY, "variant": variant, "metric": metric, "window": "5m"},
    )
    resp.raise_for_status()
    return resp.json()  # {"value": float, "count": int}


async def halt_experiment(client: httpx.AsyncClient, reason: str) -> None:
    await client.patch(
        f"{FLAG_API}/v1/flags/{FLAG_KEY}",
        json={"defaultVariant": "control", "state": "DISABLED"},
        headers={"Authorization": f"Bearer {FLAG_TOKEN}"},
    )
    await client.post(
        ALERT_WEBHOOK,
        json={"text": f"AUTO-HALT: {FLAG_KEY}{reason}"},
    )
    print(f"AUTO-HALT triggered: {reason}", file=sys.stderr)


async def run_once() -> None:
    async with httpx.AsyncClient(timeout=10) as client:
        for g in GUARDRAILS:
            control   = await fetch_metric(client, g["metric"], "control")
            treatment = await fetch_metric(client, g["metric"], "treatment")

            if control["count"] < MIN_OBSERVATIONS or treatment["count"] < MIN_OBSERVATIONS:
                continue  # too few observations — skip this round

            if control["value"] == 0:
                continue  # avoid division-by-zero on zero baseline

            relative_delta = (treatment["value"] - control["value"]) / control["value"]
            if relative_delta > g["threshold"]:
                reason = (
                    f"{g['metric']}: treatment={treatment['value']:.4f} "
                    f"control={control['value']:.4f} "
                    f"delta={relative_delta:.1%} > threshold={g['threshold']:.1%}"
                )
                await halt_experiment(client, reason)
                return  # halt fires once; subsequent checks are moot


if __name__ == "__main__":
    asyncio.run(run_once())

Run this as a cron job every two minutes, or embed it in your existing monitoring service as a recurring task.

Pitfall: comparing treatment to a time-shifted baseline (yesterday’s control numbers) rather than the concurrent control arm confounds the experiment with time-of-day variation. Always compare treatment and control groups measured in the same time window.

Step 3 — Trigger an automatic flag halt on breach

The halt in the monitor above disables the flag and forces the control variant. Verify that your flag management API accepts a PATCH that sets both state: DISABLED and defaultVariant: "control" atomically. If not, use a force-variant API call instead — the emergency kill-switch runbook shows the pattern for forcing a specific variant through the control plane without disabling the flag object.

# Verify the halt is wired correctly — dry-run without a live experiment
curl -s -X PATCH "${FLAG_API}/v1/flags/checkout.payments.express-pay" \
  -H "Authorization: Bearer ${FLAG_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{"state": "DISABLED", "defaultVariant": "control"}' \
  | jq '{state, defaultVariant}'
# expect: {"state": "DISABLED", "defaultVariant": "control"}

Step 4 — Alert with enough context to act immediately

The alert must carry the offending metric, both raw values, the relative delta, and a link to the flag definition. An alert that says “experiment halted” without these values forces the on-call engineer to reconstruct the context from scratch.

# Alert payload — include all values needed for immediate action
alert_body = {
    "text": f":rotating_light: AUTO-HALT: `{FLAG_KEY}`",
    "attachments": [{
        "color": "danger",
        "fields": [
            {"title": "Guardrail metric",   "value": g["metric"],               "short": True},
            {"title": "Treatment value",    "value": f"{treatment['value']:.4f}","short": True},
            {"title": "Control value",      "value": f"{control['value']:.4f}",  "short": True},
            {"title": "Relative delta",     "value": f"{relative_delta:.1%}",    "short": True},
            {"title": "Threshold",          "value": f"{g['threshold']:.1%}",    "short": True},
            {"title": "Flag key",           "value": FLAG_KEY,                   "short": False},
        ],
    }],
}

Verification Step

Simulate a regression and confirm the auto-halt fires before trusting it in production:

# 1. Inject a synthetic error_rate spike for the treatment variant into your metrics store
curl -s -X POST "${METRICS_API}/inject" \
  -H 'Content-Type: application/json' \
  -d '{"flag":"checkout.payments.express-pay","variant":"treatment","metric":"error_rate","value":0.04,"count":500}'

# 2. Run the monitor once (not as a cron; invoke directly)
python guardrail_monitor.py

# 3. Confirm the flag is now disabled
flagctl get checkout.payments.express-pay --env staging -o json | jq '{state, defaultVariant}'
# expect: {"state":"DISABLED","defaultVariant":"control"}

# 4. Confirm the alert webhook received the payload
curl -s "${ALERT_WEBHOOK}/last" | jq '.text'
# expect the AUTO-HALT message with metric values

# 5. Restore for the real experiment
flagctl set checkout.payments.express-pay --state ENABLED --env staging

Gotchas & Edge Cases

Troubleshooting & FAQ

The monitor halted the experiment but I cannot see which metric triggered it — where do I look?

The reason string from the halt call should appear in both the alert payload and the flag’s audit trail. If your flag management API does not accept a reason field on the PATCH call, write the reason to a structured log event alongside the halt at the time the monitor fires. See building audit trails for compliance for the full audit log pattern.

How often should the monitor run?

Every 1–5 minutes is the practical range. Below 1 minute, you risk firing on single-interval metric noise and your flag API receives halt calls before the aggregation window is stable. Above 5 minutes, a fast-moving regression (a cache poisoning bug, a sudden upstream timeout cascade) can affect a significant user population before the halt fires.

Can I set different guardrail thresholds for the first hour versus steady state?

Yes. Add a warmup_minutes field to the guardrail metadata and skip threshold checks in the monitor until now - experiment_start > warmup_minutes. The first minutes of an experiment typically have high variance from cache warming and first-visit effects; a fixed warmup window avoids premature halts during that period.