Automated flag cleanup scripts for stale toggles
Orphaned feature flags degrade system observability and inflate configuration drift. This guide delivers a deterministic, CI/CD-integrated workflow for identifying and safely removing stale toggles. Execution follows strict quarantine protocols to minimize MTTR and prevent production regressions.
Symptom Identification: Detecting Stale Toggles in Production
Accurate detection requires correlating evaluation telemetry with active code references. You must isolate flags that show zero evaluation traffic, orphaned SDK references, and mismatched rollout percentages across environments. Log aggregation and direct API polling provide the necessary signal-to-noise ratio for automated triage.
Execute the following diagnostic sequence:
- Query feature flag evaluation logs for flags with
0hits over a 30-day rolling window using your log aggregation platform. - Cross-reference flag keys against active codebase branches using AST parsing or regex-based
grepto confirm SDK removal. - Identify flags stuck in
partial rolloutstates with no recent targeting rule updates or assigned owners.
When establishing baseline metrics for flag health, align your detection thresholds with established practices in Managing Flag Deprecation and Cleanup to ensure SLA compliance before triggering automated workflows.
Root Cause Analysis: Why Toggles Become Orphaned
Abandoned flags typically stem from lifecycle breakdowns during development and promotion cycles. Common vectors include abandoned pull requests, missing deprecation metadata, environment promotion gaps, and manual override bypasses. Tracing these failures prevents recurrence and enforces strict metadata tagging at creation time.
Run these root cause diagnostics:
- Audit commit history for flag key removals versus actual SDK deprecation calls to identify incomplete refactors.
- Check CI/CD pipelines for missing pre-merge validation of flag usage in production branches.
- Analyze environment promotion logs to detect flags promoted to prod without corresponding cleanup tickets or sunset dates.
Understanding these failure modes requires mapping them to the broader Feature Flag Architecture & Lifecycle Management framework to prevent recurrence and standardize metadata tagging at creation time.
Immediate Mitigation: Safe Flag Quarantine & Dry-Run Validation
Never execute destructive cleanup without a non-destructive quarantine phase. Isolating stale toggles first prevents accidental feature regression and preserves audit trails for compliance reviews. Shadow evaluation modes allow you to validate hypothetical states without impacting end-user traffic.
Implement these validation controls:
- Configure shadow evaluation mode to log hypothetical flag states without altering user experience.
- Run dry-run scripts against staging to verify no hidden dependencies trigger fallback logic.
- Validate rollback thresholds and alert routing for unexpected evaluation spikes post-quarantine.
import requests
import datetime
FLAG_API = 'https://api.flags.example.com/v1'
STALE_THRESHOLD_DAYS = 30
def quarantine_stale_flags():
flags = requests.get(f'{FLAG_API}/flags').json()
stale = []
for f in flags:
last_eval = datetime.datetime.fromisoformat(f['last_evaluated'])
if (datetime.datetime.now() - last_eval).days > STALE_THRESHOLD_DAYS:
# Quarantine: disable targeting, keep key active for audit
requests.patch(f'{FLAG_API}/flags/{f["id"]}', json={'status': 'quarantined', 'targeting': False})
stale.append(f['key'])
return stale
Long-Term Resolution: Building the Automated Cleanup Pipeline
Sustainable flag hygiene requires a scheduled, idempotent automation pipeline integrated into your CI/CD workflow. The pipeline must safely delete confirmed stale toggles, enforce strict naming conventions, and generate compliance-ready audit trails. All deletion events must route to centralized logging for forensic rollback capability.
Deploy these pipeline safeguards:
- Schedule cron-based execution with idempotent API calls to prevent duplicate deletions.
- Implement pre-deletion validation: verify zero traffic, confirm no active targeting rules, check cross-environment sync.
- Route deletion events to centralized logging for compliance and rollback capability.
#!/bin/bash
set -euo pipefail
QUARANTINED_FLAGS=$(curl -s -H "Authorization: Bearer $FLAG_API_TOKEN" \
"https://api.flags.example.com/v1/flags?status=quarantined&traffic=0" | jq -r '.[].key')
for flag in $QUARANTINED_FLAGS; do
echo "Deleting stale flag: $flag"
curl -X DELETE -H "Authorization: Bearer $FLAG_API_TOKEN" \
"https://api.flags.example.com/v1/flags/$flag" \
--header "X-Audit-Reason: automated-stale-cleanup"
done
echo "Cleanup complete. Audit logs updated."
Edge Case Handling: Cross-Environment Dependencies & Rollback Safeguards
Shared flags across multiple environments introduce configuration drift risks during bulk deletion. You must preserve manual overrides and respect compliance constraints before executing final removal. Automated rollback triggers act as your primary safety net against evaluation anomalies.
Apply these edge case controls:
- Map environment-specific flag variants before deletion to prevent configuration drift.
- Implement a 72-hour grace period with automated rollback triggers if evaluation anomalies occur.
- Validate RBAC constraints to ensure only authorized service accounts execute deletion endpoints.