Emergency Kill-Switch & Instant-Rollback Runbook

Q: The kill switch fired but errors continue on a few hosts — why?

Those hosts are serving a cached or pre-switch rule set. Check each replica's resolved variant directly; a long local cache TTL or a dropped streaming connection is the usual cause.

Q: How do I make sure a kill switch is always fast enough?

Keep failing-feature flags on a streaming transport with a tight fallback poll, and rehearse the runbook so propagation latency is a known number before an incident.

This how-to is part of Managing Flag Deprecation & Cleanup. It is the runbook you reach for when a release behind a flag is actively causing an incident and you need the blast radius gone in seconds — not after a revert, rebuild, and redeploy.

The scenario: a flagged feature (checkout.payments.express-pay) is throwing errors in production, error budget is burning, and you are on call. A kill switch flips the feature to its safe variant for every request instantly, bypassing all targeting logic, because the change lives in the control plane rather than in your deploy pipeline. This runbook covers flipping it, proving it propagated, and recovering cleanly.

The kill switch forces a safe variant during the incident; targeting is restored only after the underlying fix is verified.

Prerequisites

Write access to the production flag control plane (or a break-glass credential)
The flag’s safe defaultVariant known and documented in the flag metadata
flagctl (or your provider CLI) authenticated against production
A way to query each replica’s resolved variant (a debug endpoint or trace query)
Sync transport propagation budget known — see polling vs streaming

Step-by-Step Procedure

Step 1 — Identify the flag and its safe variant

Confirm the exact flag key and the variant that disables the failing behavior before touching anything.

flagctl get checkout.payments.express-pay --env prod -o json | jq '{state, defaultVariant, variants}'

The defaultVariant is your target state. If the safe value isn’t obvious, the flag taxonomy metadata should record which variant is fail-safe.

Step 2 — Force the safe variant for all traffic

Override targeting entirely so every evaluation returns the safe variant, regardless of context.

flagctl set checkout.payments.express-pay \
  --env prod --force-variant off --reason "INC-4821 express-pay 5xx" --actor "$USER"

Forcing the variant (rather than disabling the flag) keeps the flag object intact for audit and makes recovery a single inverse command. The --reason and --actor land in the audit trail.

Step 3 — Confirm propagation across every replica

A kill switch you can’t confirm is a guess. Query each replica until they all report the safe variant.

for host in $(cat replicas.txt); do
  printf '%s ' "$host"; curl -s "$host/debug/flags/checkout.payments.express-pay" | jq -r '.variant'
done | sort | uniq -c     # expect every line to read "off"

If some replicas lag, you are watching your sync transport’s propagation window — streaming clears in under a second, polling within one interval.

Verification Step

Confirm the error signal actually stops. Watch the service’s 5xx rate or the failing metric for one full propagation window plus a safety margin:

# Error rate should fall to baseline within the propagation window
watch -n 5 'curl -s http://metrics.internal/q?expr=rate_5xx{service="checkout"} | jq .value'

The incident is mitigated — not resolved — once the metric returns to baseline. Recovery (Step: restore targeting) happens only after the root-cause fix ships and is verified in staging.

Gotchas & Edge Cases

Forced variant vs. flag disable: disabling a flag falls back to the SDK-supplied code default, which may differ from the control-plane defaultVariant. Force the explicit safe variant so behavior is deterministic.
Cached evaluations: a long local TTL can keep a node serving the old variant past the propagation window. Confirm per-replica (Step 3) rather than trusting a single healthy node.
Restoring too early: flipping targeting back before the fix is verified re-triggers the incident and burns trust in the kill switch. Gate recovery on a verified fix, and record it in the audit trail.

Troubleshooting & FAQ

The kill switch fired but errors continue on a few hosts — why?

Those hosts are still serving a cached or pre-switch rule set. Check their resolved variant directly (Step 3); if they lag, your local cache TTL or a dropped streaming connection is the cause. A failed sync connection is the usual culprit — verify each replica’s connection state.

Should I disable the flag or force a variant?

Force the safe variant. Disabling reverts to the in-code default, which is not guaranteed to match the control-plane safe state, and it loses the explicit intent in the audit log.

How do I make sure a kill switch is always fast enough?

Keep the failing-feature flags on a streaming transport with a tight fallback poll, and rehearse the runbook so propagation latency is a known number before an incident, not a discovery during one.