Experimentation & A/B Testing Guardrails

This guide is part of the Feature Flag Architecture & Lifecycle Management series. A/B experiments run behind feature flags give you the same traffic-control mechanism as a progressive delivery rollout, but with a different success criterion: instead of ramping to 100%, you hold a fixed split long enough to accumulate statistical evidence, then make a permanent decision. Guardrails are the safety net — the set of pre-defined metrics that halt the experiment automatically when the treatment variant is causing harm, before the full exposure period ends.

Without guardrails, a broken variant can silently drain conversion, increase error rate, or push latency past SLO limits for days before anyone notices. This guide covers how to define guardrail metrics and thresholds, how to wire the comparison loop that fires an auto-halt, how to ensure your sample size is large enough before reading results, how to attribute variants without blocking I/O, and how to isolate experiments that run concurrently. It does not cover the mechanics of stable bucketing — see percentage-based rollout with sticky bucketing — or how to build an audit trail for the flag changes an experiment generates, which is in building audit trails for compliance.

Guardrail control loop: assign, measure, compare, halt or continue Traffic is assigned to control or treatment; metrics are measured per variant; a comparator checks guardrail thresholds; breach triggers auto-halt while healthy metrics allow the experiment to continue. Assign variant hash → bucket Measure error rate, p95, conversion / variant Compare to guardrail threshold treatment vs control breach Auto-halt force control variant + alert healthy Continue until min. sample
The guardrail loop assigns traffic, measures metrics per variant, compares to thresholds, and either auto-halts on breach or continues to the minimum sample size.

Prerequisites

Core Concept & Architecture

Guardrail metrics are a pre-defined list of health indicators that the experiment must not regress, regardless of whether the primary success metric moves. They are distinct from the success metric: the success metric answers “does this variant win?”; guardrail metrics answer “is this variant safe enough to keep running?”. Common guardrails:

Guardrail metric Breach threshold Why it matters
Service error rate > 1% relative increase Catches exceptions caused by the treatment
p95 latency > 10% relative increase Detects performance regressions
Cart abandonment > 2% relative increase Catches UX harm not visible in errors
Downstream timeout rate > 0.5% absolute Detects cascading dependency failures

Guardrails are always relative or absolute comparisons between the treatment and control variant, measured on the same user population over the same time window. Comparing a treatment metric to a historical baseline (before the experiment) conflates the treatment effect with time-of-day and seasonal variation.

Step-by-Step Implementation

Step 1 — Define the experiment flag with a fixed split

Hold the split constant for the experiment’s duration. Unlike a progressive delivery ramp, you are not trying to reach 100% — you are trying to accumulate enough observations at a stable split to support a statistical conclusion.

# flagd experiment definition
flags:
  checkout.payments.express-pay:
    state: ENABLED
    variants:
      "treatment": true
      "control": false
    defaultVariant: "control"
    targeting:
      fractionalEvaluation:
        - { "var": "targetingKey" }
        - ["treatment", 50]
        - ["control", 50]

Pitfall: changing the split mid-experiment invalidates observations collected under the old split. If you must resize, restart the experiment and discard prior data.

Step 2 — Attribute variant to every outcome event without blocking I/O

Every business event that feeds your guardrail or success metrics must carry the variant the user saw. Record the variant at evaluation time and carry it forward on the request context; do not re-evaluate the flag at the point of the outcome event.

import { OpenFeature, OpenFeatureEventEmitter } from '@openfeature/server-sdk';

const client = OpenFeature.getClient();

async function handleCheckout(req: Request): Promise<void> {
  const ctx = buildEvalContext(req);
  // Evaluate once; store result on request context
  const variant = await client.getStringValue(
    'checkout.payments.express-pay', 'control', ctx
  );
  req.locals.flagVariant = variant;

  // Later, when the outcome fires — no second evaluation
  metrics.increment('checkout.completed', { variant: req.locals.flagVariant });
}

Pitfall: calling the flag again at the outcome event introduces a race condition if the flag changed between request start and outcome — the attributed variant will not match the one the user experienced.

Step 3 — Wire guardrail monitors with automatic halt

Set up a monitor that queries per-variant metric aggregates on a short interval (1–5 minutes) and compares treatment to control. On breach, the monitor calls a webhook that sets the experiment flag to force the control variant.

# guardrail_monitor.py — runs as a cron job or in a monitoring service
import httpx, os, sys

FLAG_KEY = "checkout.payments.express-pay"
FLAG_API_TOKEN = os.environ["FLAG_API_TOKEN"]
GUARDRAILS = [
    {"metric": "error_rate", "relative_increase_limit": 0.01},
    {"metric": "p95_latency_ms", "relative_increase_limit": 0.10},
]

async def check_guardrails(metrics_client):
    control_metrics = await metrics_client.query_variant(FLAG_KEY, "control")
    treatment_metrics = await metrics_client.query_variant(FLAG_KEY, "treatment")

    for g in GUARDRAILS:
        control_val = control_metrics[g["metric"]]
        treatment_val = treatment_metrics[g["metric"]]
        if control_val > 0:
            relative_delta = (treatment_val - control_val) / control_val
            if relative_delta > g["relative_increase_limit"]:
                await trigger_halt(g["metric"], relative_delta)
                return

async def trigger_halt(metric: str, delta: float):
    async with httpx.AsyncClient() as http:
        await http.patch(
            f"https://flags.internal/v1/flags/{FLAG_KEY}",
            json={"state": "DISABLED"},
            headers={"Authorization": f"Bearer {FLAG_API_TOKEN}"},
        )
    print(f"AUTO-HALT: {FLAG_KEY}{metric} regressed by {delta:.1%}", file=sys.stderr)

Pitfall: halting by setting state: DISABLED reverts to the in-code default, which may not match the safe control variant. Prefer forcing defaultVariant: "control" explicitly, or use a kill-switch pattern that forces the variant rather than disabling the flag.

Step 4 — Calculate minimum sample size before starting

Reading results before reaching statistical significance produces false conclusions. Calculate the required sample size per variant before running:

from scipy.stats import norm
import math

def min_sample_per_variant(
    baseline_rate: float,   # e.g. 0.05 for 5% conversion
    min_detectable_effect: float,  # e.g. 0.01 for 1pp lift
    alpha: float = 0.05,    # significance level
    power: float = 0.80,    # 1 - beta
) -> int:
    p1 = baseline_rate
    p2 = baseline_rate + min_detectable_effect
    pooled = (p1 + p2) / 2
    z_alpha = norm.ppf(1 - alpha / 2)
    z_beta = norm.ppf(power)
    n = (z_alpha * math.sqrt(2 * pooled * (1 - pooled)) +
         z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2 / (p2 - p1) ** 2
    return math.ceil(n)

required = min_sample_per_variant(baseline_rate=0.05, min_detectable_effect=0.01)
print(f"Minimum {required} users per variant before reading results")

Pitfall: “peeking” — checking results before minimum sample is reached and stopping early if you like what you see — inflates the false-positive rate. Commit the sample size before starting and record it in the flag metadata as experiment_min_sample.

Verification & Testing

Simulate a regression to confirm the auto-halt fires before running the experiment in production:

# Inject artificial metric values above the guardrail threshold into the metrics store
curl -s -X POST http://metrics.internal/inject \
  -H 'Content-Type: application/json' \
  -d '{"flag":"checkout.payments.express-pay","variant":"treatment","error_rate":0.03}'

# Run the guardrail monitor once and confirm it halts the flag
python guardrail_monitor.py --dry-run=false

# Verify the flag is now disabled or forced to control
flagctl get checkout.payments.express-pay --env prod -o json | jq '.state'
# expect "DISABLED" or defaultVariant "control"

Restore to ENABLED after confirming the halt fires correctly, and clear the injected metrics.

Troubleshooting & FAQ

The guardrail fired but the p-value on the success metric is significant — should I trust the result?

No. Guardrail metrics are independent of the success metric. If a guardrail breaches, the experiment must halt regardless of whether the primary metric is moving in your favour — the variant is causing harm to a metric you committed to protecting. Investigate the guardrail regression before re-running.

My experiment has very low traffic — the guardrail is firing on noise. What should I do?

Widen the guardrail threshold to account for variance at low sample sizes, or delay guardrail evaluation until a minimum number of observations (e.g. 100 per variant per metric) have accumulated. Firing on two observations is not informative. Alternatively, reduce the number of concurrent experiments to concentrate traffic on the one being evaluated.

How do I isolate two concurrent experiments so one does not contaminate the other?

Assign each experiment to a disjoint bucket range. If experiment A uses buckets 0–49 and experiment B uses buckets 50–99, the populations never overlap. Use the sticky bucketing approach with separate flag keys — each flag hashes independently, so the same targetingKey can land in different buckets for different flags. Document the experiment namespace and reserved ranges in your flag registry to prevent accidental overlap.

Performance & Scale Considerations

Guardrail monitoring is an out-of-band process — it runs on aggregated metrics, not on the hot evaluation path. The evaluation call itself is identical to any other flag call: a deterministic hash and a threshold comparison, completing in microseconds. The only additional cost is the variant attribute attached to outcome events, which is a string tag on a metrics counter or a structured log field — negligible at any scale.

At high traffic volumes, aggregate metrics per variant in your existing telemetry pipeline rather than querying raw event logs in the guardrail monitor. Pre-aggregated counters (error count per variant per minute) are fast to compare and easy to alert on. The backend evaluation guides cover how to attach variant context to traces without adding latency.