Emergency Kill-Switch & Instant-Rollback Runbook

This how-to is part of Managing Flag Deprecation & Cleanup. It is the runbook you reach for when a release behind a flag is actively causing an incident and you need the blast radius gone in seconds — not after a revert, rebuild, and redeploy.

The scenario: a flagged feature (checkout.payments.express-pay) is throwing errors in production, error budget is burning, and you are on call. A kill switch flips the feature to its safe variant for every request instantly, bypassing all targeting logic, because the change lives in the control plane rather than in your deploy pipeline. This runbook covers flipping it, proving it propagated, and recovering cleanly.

Kill-switch state transitions A flag moves from normal targeting to a forced safe variant during an incident, then back to normal after the fix is verified. Normal targeting rules evaluated Kill switch ON forced safe variant Recovered targeting restored incident fix verified
The kill switch forces a safe variant during the incident; targeting is restored only after the underlying fix is verified.

Prerequisites

Step-by-Step Procedure

Step 1 — Identify the flag and its safe variant

Confirm the exact flag key and the variant that disables the failing behavior before touching anything.

flagctl get checkout.payments.express-pay --env prod -o json | jq '{state, defaultVariant, variants}'

The defaultVariant is your target state. If the safe value isn’t obvious, the flag taxonomy metadata should record which variant is fail-safe.

Step 2 — Force the safe variant for all traffic

Override targeting entirely so every evaluation returns the safe variant, regardless of context.

flagctl set checkout.payments.express-pay \
  --env prod --force-variant off --reason "INC-4821 express-pay 5xx" --actor "$USER"

Forcing the variant (rather than disabling the flag) keeps the flag object intact for audit and makes recovery a single inverse command. The --reason and --actor land in the audit trail.

Step 3 — Confirm propagation across every replica

A kill switch you can’t confirm is a guess. Query each replica until they all report the safe variant.

for host in $(cat replicas.txt); do
  printf '%s ' "$host"; curl -s "$host/debug/flags/checkout.payments.express-pay" | jq -r '.variant'
done | sort | uniq -c     # expect every line to read "off"

If some replicas lag, you are watching your sync transport’s propagation window — streaming clears in under a second, polling within one interval.

Verification Step

Confirm the error signal actually stops. Watch the service’s 5xx rate or the failing metric for one full propagation window plus a safety margin:

# Error rate should fall to baseline within the propagation window
watch -n 5 'curl -s http://metrics.internal/q?expr=rate_5xx{service="checkout"} | jq .value'

The incident is mitigated — not resolved — once the metric returns to baseline. Recovery (Step: restore targeting) happens only after the root-cause fix ships and is verified in staging.

Gotchas & Edge Cases

Troubleshooting & FAQ

The kill switch fired but errors continue on a few hosts — why?

Those hosts are still serving a cached or pre-switch rule set. Check their resolved variant directly (Step 3); if they lag, your local cache TTL or a dropped streaming connection is the cause. A failed sync connection is the usual culprit — verify each replica’s connection state.

Should I disable the flag or force a variant?

Force the safe variant. Disabling reverts to the in-code default, which is not guaranteed to match the control-plane safe state, and it loses the explicit intent in the audit log.

How do I make sure a kill switch is always fast enough?

Keep the failing-feature flags on a streaming transport with a tight fallback poll, and rehearse the runbook so propagation latency is a known number before an incident, not a discovery during one.