How to Prevent Flag Sprawl in Microservices
This how-to is part of Designing a Scalable Flag Taxonomy. Sprawl is the predictable consequence of skipping that taxonomy: flags accumulate across services with no owner, no expiry, and no automated path to removal. When a microservice carries more than 15 active flags, evaluation overhead climbs, targeting logic becomes a maze, and incident responders waste minutes identifying which flag is relevant. This guide gives you a concrete procedure to measure sprawl, stop new flags from making it worse, and remove the ones that are already stale.
Prerequisites
ajv-cli(or equivalent JSON Schema validator) available in CIowner,expiry, andstatepopulated per the flag taxonomy
Step-by-Step Procedure
Step 1 — Measure your current sprawl baseline
Before you can stop sprawl, you need to know its shape. Count active flags per service, identify zero-evaluation flags over a rolling window, and flag any entries with missing metadata.
#!/usr/bin/env bash
# sprawl-audit.sh — generates a per-service flag density report
set -euo pipefail
REGISTRY="flags/registry.json"
EVAL_METRICS_URL="http://prometheus.internal/api/v1/query"
WINDOW="7d"
echo "=== Flag density by service ==="
jq -r 'to_entries[] | "\(.value.metadata.service // "unknown") \(.key)"' "$REGISTRY" \
| sort | awk '{counts[$1]++} END {for (s in counts) print counts[s], s}' \
| sort -rn
echo ""
echo "=== Flags missing mandatory metadata ==="
jq -r 'to_entries[]
| select(.value.metadata.owner == null or .value.metadata.expiry == null)
| .key' "$REGISTRY"
echo ""
echo "=== Zero-evaluation flags (last 7 days) ==="
# Query Prometheus for flags with zero evaluations
curl -sG "$EVAL_METRICS_URL" \
--data-urlencode "query=sum by (flag_key)(increase(flag_evaluations_total[${WINDOW}])) == 0" \
| jq -r '.data.result[].metric.flag_key'
Any service reporting more than 15 active flags warrants an immediate cleanup sprint. Zero-evaluation flags are safe removal candidates once verified in staging.
Cross-link: if you are also seeing latency above 50ms on flag evaluations, the root cause is usually rule-engine overhead from too many active rules — see optimizing rule engine performance.
Step 2 — Block new sprawl at the CI gate
Sprawl prevention starts at flag creation. A CI check that requires owner, expiry, and a valid key format stops poorly-structured flags from ever reaching production.
// validate-flags.js — run as a CI pre-merge check (Node.js)
const Ajv = require('ajv').default;
const addFormats = require('ajv-formats');
const { readFileSync } = require('fs');
const ajv = new Ajv({ allErrors: true });
addFormats(ajv);
const schema = {
type: 'object',
required: ['owner', 'type', 'created', 'expiry', 'state', 'defaultVariant'],
properties: {
owner: { type: 'string', minLength: 2 },
type: { enum: ['release', 'experiment', 'ops', 'kill'] },
created: { type: 'string', format: 'date' },
expiry: { type: 'string', format: 'date' },
state: { enum: ['draft', 'active', 'deprecated', 'archived'] },
defaultVariant: { type: 'string' },
}
};
const validate = ajv.compile(schema);
const KEY_REGEX = /^(kill|exp|ops|[a-z][a-z0-9]*)(\.[a-z][a-z0-9-]*)(\.[a-z][a-z0-9-]*)$/;
const registry = JSON.parse(readFileSync('flags/registry.json', 'utf8'));
let failed = false;
for (const [key, flag] of Object.entries(registry)) {
if (!KEY_REGEX.test(key)) {
console.error(`FAIL key format: ${key}`);
failed = true;
}
const meta = flag.metadata ?? {};
if (!validate(meta)) {
console.error(`FAIL metadata for ${key}:`, validate.errors.map(e => e.message).join(', '));
failed = true;
}
if (meta.expiry && new Date(meta.expiry) < new Date()) {
console.error(`FAIL past expiry: ${key} expired ${meta.expiry}`);
failed = true;
}
}
if (failed) process.exit(1);
console.log('All flags passed validation.');
Wire this script into your CI pipeline as a required check on any PR that modifies flags/. It enforces the key schema from naming conventions for feature flag keys without requiring manual review.
Pitfall: adding this check retroactively to an existing repo will fail on every legacy key on day one. Run it in --warn mode for two weeks while teams fix violations, then flip to hard-fail.
Step 3 — Identify and quarantine zero-evaluation flags
Flags that receive no evaluations over 7 days are candidates for immediate deprecation. Before removing them, verify they are not guarding dead code paths that only trigger on rare events.
#!/usr/bin/env python3
"""quarantine-stale.py — marks zero-eval flags as deprecated and notifies owners."""
import json, subprocess
from datetime import datetime
from pathlib import Path
REGISTRY_PATH = Path("flags/registry.json")
registry = json.loads(REGISTRY_PATH.read_text())
# Assume zero_eval_keys comes from your metrics query (Prometheus, ClickHouse, etc.)
# For demonstration we accept them as command-line args or a file
import sys
zero_eval_keys = set(sys.stdin.read().split()) if not sys.stdin.isatty() else set()
changes = []
for key, flag in registry.items():
meta = flag.get("metadata", {})
if key in zero_eval_keys and meta.get("state") == "active":
meta["state"] = "deprecated"
meta["deprecated_at"] = datetime.utcnow().date().isoformat()
changes.append((key, meta["owner"]))
REGISTRY_PATH.write_text(json.dumps(registry, indent=2))
for key, owner in changes:
print(f"DEPRECATED {key} (owner: {owner}) — zero evaluations in last 7d")
# In practice: POST to Slack/PagerDuty/Jira with owner and remediation link
Pitfall: a flag guarding a quarterly billing job may show zero evaluations for weeks. Cross-reference type: "ops" and type: "experiment" flags against a calendar of expected evaluation events before mass-deprecating.
Step 4 — Enforce namespace ownership to prevent cross-service conflicts
The most common cause of cross-team flag collisions is a shared namespace with no ownership boundary. Add a namespace ownership file and a CI check that prevents any team from writing into another team’s namespace.
# .service-owners.json — maps each namespace to the repo that owns it
# {
# "checkout": "org/checkout-service",
# "api": "org/api-gateway",
# "ops": ["org/platform", "org/sre"], <-- shared: array of authorized repos
# "exp": ["org/growth", "org/platform"]
# }
# namespace-check.sh — run in CI on any PR touching flags/
set -euo pipefail
OWNERS_FILE=".service-owners.json"
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD | grep '^flags/' || true)
for file in $CHANGED_FILES; do
namespaces=$(jq -r 'keys[]' "$file" | cut -d'.' -f1 | sort -u)
for ns in $namespaces; do
authorized=$(jq -r ".\"$ns\" // empty" "$OWNERS_FILE")
if [ -z "$authorized" ]; then
echo "ERROR: namespace '$ns' has no registered owner in $OWNERS_FILE"; exit 1
fi
# Handle both string and array values
if ! echo "$authorized" | jq -e --arg repo "$GITHUB_REPOSITORY" \
'if type == "array" then any(. == $repo) else . == $repo end' > /dev/null 2>&1; then
echo "ERROR: $GITHUB_REPOSITORY is not authorized for namespace '$ns'"; exit 1
fi
done
done
echo "Namespace ownership check passed."
Verification
After all four steps are in place, run the audit script from Step 1 again and compare baselines:
bash sprawl-audit.sh 2>/dev/null | tee sprawl-report-$(date +%F).txt
diff sprawl-report-before.txt sprawl-report-$(date +%F).txt
Success looks like: flag count per service trending down or flat; zero flags with missing owner or expiry; zero flags past their expiry in active state; CI blocking every new flag that fails schema validation.
Gotchas & Edge Cases
- Shared experiment flags:
exp.prefix flags for A/B tests typically span multiple services. Track them under the team running the experiment, not the services evaluating them, and include ananalysis_windowend date rather than an arbitrary TTL. - Kill switches are long-lived by design:
kill.prefix flags are operational safety valves. Do not deprecate them just because they have no traffic — they should be evaluated only during an incident. Excludetype: "kill"from zero-eval scanning. - Forked evaluation paths in tests: integration test suites often evaluate flags in contexts that never reach production metrics. Do not count test-harness evaluations as evidence that a flag is active; filter by environment label in your metrics query.
Troubleshooting & FAQ
The CI metadata check is rejecting flags that were valid last week — why?
Most likely the expiry date crossed into the past, which the validator rejects as a past-expiry error. The flag owner must either extend the expiry (if the feature is still in rollout) or transition the flag to deprecated and open a code-removal PR. Check meta.expiry against today’s date and act on whichever path applies.
How do we handle flags owned by engineers who left the company?
The owner field should always be a team name, not an individual. If you have legacy flags with individual owners, re-assign them to the team that owns the service during your baseline cleanup sprint. Add a CI check that rejects any owner value not present in your team registry.
Is 15 flags per service a hard limit?
It’s a calibrated threshold, not a law. High-traffic services with active A/B test programs may legitimately carry 20–30 flags. The metric that actually matters is zero-evaluation flags — those are unambiguous waste. Use 15 as a prompt to audit, not as an automatic trigger for deletion.