How to Prevent Flag Sprawl in Microservices

This how-to is part of Designing a Scalable Flag Taxonomy. Sprawl is the predictable consequence of skipping that taxonomy: flags accumulate across services with no owner, no expiry, and no automated path to removal. When a microservice carries more than 15 active flags, evaluation overhead climbs, targeting logic becomes a maze, and incident responders waste minutes identifying which flag is relevant. This guide gives you a concrete procedure to measure sprawl, stop new flags from making it worse, and remove the ones that are already stale.

Flag sprawl prevention pipeline Flags enter through a CI gate that enforces metadata and namespace ownership; a nightly scanner identifies stale flags; a cleanup workflow removes them. PR / CI gate schema lint namespace check Flag registry owner · expiry state · type Nightly scanner zero-eval · expired Cleanup workflow deprecate · audit code removal PR passes stale flags report ✕ rejected
Every new flag passes through a CI gate that checks schema and namespace; a nightly scanner surfaces stale candidates; a cleanup workflow retires them with an audit record.

Prerequisites

Step-by-Step Procedure

Step 1 — Measure your current sprawl baseline

Before you can stop sprawl, you need to know its shape. Count active flags per service, identify zero-evaluation flags over a rolling window, and flag any entries with missing metadata.

#!/usr/bin/env bash
# sprawl-audit.sh — generates a per-service flag density report
set -euo pipefail

REGISTRY="flags/registry.json"
EVAL_METRICS_URL="http://prometheus.internal/api/v1/query"
WINDOW="7d"

echo "=== Flag density by service ==="
jq -r 'to_entries[] | "\(.value.metadata.service // "unknown") \(.key)"' "$REGISTRY" \
  | sort | awk '{counts[$1]++} END {for (s in counts) print counts[s], s}' \
  | sort -rn

echo ""
echo "=== Flags missing mandatory metadata ==="
jq -r 'to_entries[]
  | select(.value.metadata.owner == null or .value.metadata.expiry == null)
  | .key' "$REGISTRY"

echo ""
echo "=== Zero-evaluation flags (last 7 days) ==="
# Query Prometheus for flags with zero evaluations
curl -sG "$EVAL_METRICS_URL" \
  --data-urlencode "query=sum by (flag_key)(increase(flag_evaluations_total[${WINDOW}])) == 0" \
  | jq -r '.data.result[].metric.flag_key'

Any service reporting more than 15 active flags warrants an immediate cleanup sprint. Zero-evaluation flags are safe removal candidates once verified in staging.

Cross-link: if you are also seeing latency above 50ms on flag evaluations, the root cause is usually rule-engine overhead from too many active rules — see optimizing rule engine performance.

Step 2 — Block new sprawl at the CI gate

Sprawl prevention starts at flag creation. A CI check that requires owner, expiry, and a valid key format stops poorly-structured flags from ever reaching production.

// validate-flags.js — run as a CI pre-merge check (Node.js)
const Ajv = require('ajv').default;
const addFormats = require('ajv-formats');
const { readFileSync } = require('fs');

const ajv = new Ajv({ allErrors: true });
addFormats(ajv);

const schema = {
  type: 'object',
  required: ['owner', 'type', 'created', 'expiry', 'state', 'defaultVariant'],
  properties: {
    owner:          { type: 'string', minLength: 2 },
    type:           { enum: ['release', 'experiment', 'ops', 'kill'] },
    created:        { type: 'string', format: 'date' },
    expiry:         { type: 'string', format: 'date' },
    state:          { enum: ['draft', 'active', 'deprecated', 'archived'] },
    defaultVariant: { type: 'string' },
  }
};

const validate = ajv.compile(schema);
const KEY_REGEX = /^(kill|exp|ops|[a-z][a-z0-9]*)(\.[a-z][a-z0-9-]*)(\.[a-z][a-z0-9-]*)$/;

const registry = JSON.parse(readFileSync('flags/registry.json', 'utf8'));
let failed = false;

for (const [key, flag] of Object.entries(registry)) {
  if (!KEY_REGEX.test(key)) {
    console.error(`FAIL key format: ${key}`);
    failed = true;
  }
  const meta = flag.metadata ?? {};
  if (!validate(meta)) {
    console.error(`FAIL metadata for ${key}:`, validate.errors.map(e => e.message).join(', '));
    failed = true;
  }
  if (meta.expiry && new Date(meta.expiry) < new Date()) {
    console.error(`FAIL past expiry: ${key} expired ${meta.expiry}`);
    failed = true;
  }
}

if (failed) process.exit(1);
console.log('All flags passed validation.');

Wire this script into your CI pipeline as a required check on any PR that modifies flags/. It enforces the key schema from naming conventions for feature flag keys without requiring manual review.

Pitfall: adding this check retroactively to an existing repo will fail on every legacy key on day one. Run it in --warn mode for two weeks while teams fix violations, then flip to hard-fail.

Step 3 — Identify and quarantine zero-evaluation flags

Flags that receive no evaluations over 7 days are candidates for immediate deprecation. Before removing them, verify they are not guarding dead code paths that only trigger on rare events.

#!/usr/bin/env python3
"""quarantine-stale.py — marks zero-eval flags as deprecated and notifies owners."""
import json, subprocess
from datetime import datetime
from pathlib import Path

REGISTRY_PATH = Path("flags/registry.json")
registry = json.loads(REGISTRY_PATH.read_text())

# Assume zero_eval_keys comes from your metrics query (Prometheus, ClickHouse, etc.)
# For demonstration we accept them as command-line args or a file
import sys
zero_eval_keys = set(sys.stdin.read().split()) if not sys.stdin.isatty() else set()

changes = []
for key, flag in registry.items():
    meta = flag.get("metadata", {})
    if key in zero_eval_keys and meta.get("state") == "active":
        meta["state"] = "deprecated"
        meta["deprecated_at"] = datetime.utcnow().date().isoformat()
        changes.append((key, meta["owner"]))

REGISTRY_PATH.write_text(json.dumps(registry, indent=2))

for key, owner in changes:
    print(f"DEPRECATED {key} (owner: {owner}) — zero evaluations in last 7d")
    # In practice: POST to Slack/PagerDuty/Jira with owner and remediation link

Pitfall: a flag guarding a quarterly billing job may show zero evaluations for weeks. Cross-reference type: "ops" and type: "experiment" flags against a calendar of expected evaluation events before mass-deprecating.

Step 4 — Enforce namespace ownership to prevent cross-service conflicts

The most common cause of cross-team flag collisions is a shared namespace with no ownership boundary. Add a namespace ownership file and a CI check that prevents any team from writing into another team’s namespace.

# .service-owners.json — maps each namespace to the repo that owns it
# {
#   "checkout": "org/checkout-service",
#   "api":      "org/api-gateway",
#   "ops":      ["org/platform", "org/sre"],   <-- shared: array of authorized repos
#   "exp":      ["org/growth", "org/platform"]
# }

# namespace-check.sh — run in CI on any PR touching flags/
set -euo pipefail
OWNERS_FILE=".service-owners.json"
CHANGED_FILES=$(git diff --name-only HEAD~1 HEAD | grep '^flags/' || true)

for file in $CHANGED_FILES; do
  namespaces=$(jq -r 'keys[]' "$file" | cut -d'.' -f1 | sort -u)
  for ns in $namespaces; do
    authorized=$(jq -r ".\"$ns\" // empty" "$OWNERS_FILE")
    if [ -z "$authorized" ]; then
      echo "ERROR: namespace '$ns' has no registered owner in $OWNERS_FILE"; exit 1
    fi
    # Handle both string and array values
    if ! echo "$authorized" | jq -e --arg repo "$GITHUB_REPOSITORY" \
        'if type == "array" then any(. == $repo) else . == $repo end' > /dev/null 2>&1; then
      echo "ERROR: $GITHUB_REPOSITORY is not authorized for namespace '$ns'"; exit 1
    fi
  done
done
echo "Namespace ownership check passed."

Verification

After all four steps are in place, run the audit script from Step 1 again and compare baselines:

bash sprawl-audit.sh 2>/dev/null | tee sprawl-report-$(date +%F).txt
diff sprawl-report-before.txt sprawl-report-$(date +%F).txt

Success looks like: flag count per service trending down or flat; zero flags with missing owner or expiry; zero flags past their expiry in active state; CI blocking every new flag that fails schema validation.

Gotchas & Edge Cases

Troubleshooting & FAQ

The CI metadata check is rejecting flags that were valid last week — why?

Most likely the expiry date crossed into the past, which the validator rejects as a past-expiry error. The flag owner must either extend the expiry (if the feature is still in rollout) or transition the flag to deprecated and open a code-removal PR. Check meta.expiry against today’s date and act on whichever path applies.

How do we handle flags owned by engineers who left the company?

The owner field should always be a team name, not an individual. If you have legacy flags with individual owners, re-assign them to the team that owns the service during your baseline cleanup sprint. Add a CI check that rejects any owner value not present in your team registry.

Is 15 flags per service a hard limit?

It’s a calibrated threshold, not a law. High-traffic services with active A/B test programs may legitimately carry 20–30 flags. The metric that actually matters is zero-evaluation flags — those are unambiguous waste. Use 15 as a prompt to audit, not as an automatic trigger for deletion.