Emergency Kill-Switch & Instant-Rollback Runbook
This how-to is part of Managing Flag Deprecation & Cleanup. It is the runbook you reach for when a release behind a flag is actively causing an incident and you need the blast radius gone in seconds — not after a revert, rebuild, and redeploy.
The scenario: a flagged feature (checkout.payments.express-pay) is throwing errors in production, error budget is burning, and you are on call. A kill switch flips the feature to its safe variant for every request instantly, bypassing all targeting logic, because the change lives in the control plane rather than in your deploy pipeline. This runbook covers flipping it, proving it propagated, and recovering cleanly.
Prerequisites
defaultVariantknown and documented in the flag metadataflagctl(or your provider CLI) authenticated against production- polling vs streaming
Step-by-Step Procedure
Step 1 — Identify the flag and its safe variant
Confirm the exact flag key and the variant that disables the failing behavior before touching anything.
flagctl get checkout.payments.express-pay --env prod -o json | jq '{state, defaultVariant, variants}'
The defaultVariant is your target state. If the safe value isn’t obvious, the flag taxonomy metadata should record which variant is fail-safe.
Step 2 — Force the safe variant for all traffic
Override targeting entirely so every evaluation returns the safe variant, regardless of context.
flagctl set checkout.payments.express-pay \
--env prod --force-variant off --reason "INC-4821 express-pay 5xx" --actor "$USER"
Forcing the variant (rather than disabling the flag) keeps the flag object intact for audit and makes recovery a single inverse command. The --reason and --actor land in the audit trail.
Step 3 — Confirm propagation across every replica
A kill switch you can’t confirm is a guess. Query each replica until they all report the safe variant.
for host in $(cat replicas.txt); do
printf '%s ' "$host"; curl -s "$host/debug/flags/checkout.payments.express-pay" | jq -r '.variant'
done | sort | uniq -c # expect every line to read "off"
If some replicas lag, you are watching your sync transport’s propagation window — streaming clears in under a second, polling within one interval.
Verification Step
Confirm the error signal actually stops. Watch the service’s 5xx rate or the failing metric for one full propagation window plus a safety margin:
# Error rate should fall to baseline within the propagation window
watch -n 5 'curl -s http://metrics.internal/q?expr=rate_5xx{service="checkout"} | jq .value'
The incident is mitigated — not resolved — once the metric returns to baseline. Recovery (Step: restore targeting) happens only after the root-cause fix ships and is verified in staging.
Gotchas & Edge Cases
- Forced variant vs. flag disable: disabling a flag falls back to the SDK-supplied code default, which may differ from the control-plane
defaultVariant. Force the explicit safe variant so behavior is deterministic. - Cached evaluations: a long local TTL can keep a node serving the old variant past the propagation window. Confirm per-replica (Step 3) rather than trusting a single healthy node.
- Restoring too early: flipping targeting back before the fix is verified re-triggers the incident and burns trust in the kill switch. Gate recovery on a verified fix, and record it in the audit trail.
Troubleshooting & FAQ
The kill switch fired but errors continue on a few hosts — why?
Those hosts are still serving a cached or pre-switch rule set. Check their resolved variant directly (Step 3); if they lag, your local cache TTL or a dropped streaming connection is the cause. A failed sync connection is the usual culprit — verify each replica’s connection state.
Should I disable the flag or force a variant?
Force the safe variant. Disabling reverts to the in-code default, which is not guaranteed to match the control-plane safe state, and it loses the explicit intent in the audit log.
How do I make sure a kill switch is always fast enough?
Keep the failing-feature flags on a streaming transport with a tight fallback poll, and rehearse the runbook so propagation latency is a known number before an incident, not a discovery during one.