Designing a Scalable Flag Taxonomy
This guide is part of the Feature Flag Architecture & Lifecycle Management series. Without a deliberate taxonomy, a flag system degrades quickly: keys accumulate with no owner, targeting rules clash across services, and cleanup becomes archaeology. A taxonomy is the schema and governance layer that keeps every flag findable, attributable, and disposable — at any scale.
This guide covers the key schema, required metadata fields, lifecycle states, and the CI enforcement that makes the rules stick. It does not cover SDK integration patterns (see server-side SDK integration) or the transport layer that delivers updates to replicas.
Problem Framing
Growing engineering organizations accumulate flags the same way they accumulate dead code: faster than anyone cleans them up. A service with no naming standard ends up with keys like enable_new_checkout, newCheckoutFlow, and USE_CHECKOUT_V2 all in flight simultaneously — owned by no one, never removed, each one a trap for the next engineer who reads the codebase. Ownership disputes delay cleanup; missing expiry dates mean no automated system can ever know a flag is safe to remove; duplicate keys across microservices cause targeting logic in one team’s flags to shadow another’s.
This guide does not cover progressive delivery workflows, SDK initialization, or cache topology — it focuses exclusively on what a flag record is and the rules that govern it.
Prerequisites
ajv-clior equivalent)CODEOWNERSor a.service-owners.json)owner,expiry,type, andstateare supported by your provider
Core Concept & Architecture
The namespace.service.feature Key Schema
A three-segment key is the minimum viable structure. Each segment answers a distinct question:
- namespace — the bounded context or product domain (
checkout,api,web,ops) - service — the specific microservice or team that owns the flag (
payments,search,auth) - feature — the capability being gated, in kebab-case (
express-pay,semantic-rerank,new-nav)
The full key checkout.payments.express-pay is unambiguous at a glance. The namespace prefix routes it to the right team in any registry query; the service segment enforces ownership; the feature segment describes intent without abbreviation.
Reserved prefixes impose global semantics without bespoke tooling:
| Prefix | Meaning | Example |
|---|---|---|
kill. |
Emergency kill-switch; streaming transport required | kill.payments.express-pay |
exp. |
Experiment / A/B test; has analysis window metadata | exp.checkout.one-click-upsell |
ops. |
Operational toggle; indefinite lifetime permitted | ops.infra.maintenance-mode |
| (none) | Standard release flag; max 90-day TTL | checkout.payments.express-pay |
See naming conventions for feature flag keys for the regex lint rule and CI enforcement steps.
Metadata Schema
Every flag record must carry five mandatory fields. Optional fields extend it for compliance:
# flagd-format definition with full taxonomy metadata
flags:
checkout.payments.express-pay:
state: ENABLED
variants:
"on": true
"off": false
defaultVariant: "off" # the safe fallback — used by kill-switch runbooks
targeting:
if:
- { "==": [ { var: "tenantTier" }, "enterprise" ] }
- "on"
- "off"
# taxonomy metadata (stored in a sidecar or provider custom fields)
metadata:
owner: "payments-team" # team, not individual
type: "release" # release | experiment | ops | kill
created: "2026-06-01"
expiry: "2026-08-01" # hard deadline; CI blocks past-expiry flags
state: "active" # draft | active | deprecated | archived
ticket: "PAY-1234" # links flag to the work item
The defaultVariant field is load-bearing: it is the variant an emergency kill-switch forces when an incident requires instant rollback, so it must always be the safe state.
Lifecycle States
Flags move through four states; transitions must be automated, not manual:
- draft — the schema is valid but the flag is not yet live; used during code review and staging validation.
- active — the flag is live in production; targeting rules are evaluated on every matching request.
- deprecated — the feature code has shipped unconditionally and the flag is pending removal; new evaluations are logged as warnings. This is the entry point for automated cleanup workflows.
- archived — the flag definition is retained for audit history but the evaluation engine ignores it.
An audit trail entry should be written on every state transition, recording the actor, timestamp, and reason.
Step-by-Step Implementation
Step 1 — Define the key schema and publish it
Write the naming standard in a single authoritative document and enforce it with a lint rule before any code review conversation is needed.
# .flaglint.yaml — add this to the repo root
key_pattern: '^(kill|exp|ops|[a-z][a-z0-9]*)(\.[a-z][a-z0-9-]*)(\.[a-z][a-z0-9-]*)$'
max_segments: 3
segment_case: kebab
reserved_prefixes:
kill: { transport: streaming, max_ttl_days: null }
exp: { requires_fields: [analysis_window, hypothesis] }
ops: { max_ttl_days: null }
default_max_ttl_days: 90
# CI step: lint all flag keys before merge
npx flaglint --config .flaglint.yaml ./flags/**/*.yaml \
|| { echo "Flag key lint failed"; exit 1; }
Pitfall: introducing the schema mid-flight without a migration plan leaves a graveyard of legacy keys that the linter rejects but teams are afraid to rename. Run the linter in warn-only mode for one sprint to inventory violations, then enforce on a fixed date.
Step 2 — Enforce mandatory metadata in JSON Schema
A schema check at CI time is cheaper than a post-mortem about a flag with no owner.
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "FlagMetadata",
"type": "object",
"required": ["owner", "type", "created", "expiry", "state", "defaultVariant"],
"properties": {
"owner": { "type": "string", "minLength": 2 },
"type": { "enum": ["release", "experiment", "ops", "kill"] },
"created": { "type": "string", "format": "date" },
"expiry": { "type": "string", "format": "date" },
"state": { "enum": ["draft", "active", "deprecated", "archived"] },
"defaultVariant": { "type": "string" },
"ticket": { "type": "string" }
}
}
# Validate every flag definition file against the schema
npx ajv-cli validate -s ./flags/schema.json -d './flags/**/*.json' \
|| { echo "Flag metadata schema validation failed"; exit 1; }
Pitfall: expiry as a free-form string lets teams write "soon" or "Q3". Enforce ISO 8601 (format: "date") and add a CI step that rejects any flag whose expiry is in the past.
Step 3 — Automate expiry alerting and state transitions
Metadata only works if something acts on it. A nightly job that compares expiry dates to today and transitions flags to deprecated closes the loop without manual tracking.
#!/usr/bin/env python3
"""Nightly flag expiry checker — transitions active flags to deprecated when past expiry."""
import json, sys
from datetime import date, timedelta
from pathlib import Path
WARN_DAYS = 14 # send alert when flag is within 14 days of expiry
flags = json.loads(Path("flags/registry.json").read_text())
today = date.today()
for key, meta in flags.items():
expiry = date.fromisoformat(meta["expiry"])
if meta["state"] != "active":
continue
if expiry < today:
# Transition to deprecated and notify
meta["state"] = "deprecated"
print(f"DEPRECATED: {key} (expired {expiry}). Owner: {meta['owner']}")
# POST to flag API or write back to registry file
elif expiry - today <= timedelta(days=WARN_DAYS):
print(f"WARN: {key} expires in {(expiry - today).days}d. Owner: {meta['owner']}")
Path("flags/registry.json").write_text(json.dumps(flags, indent=2))
This feeds directly into the managing flag deprecation and cleanup workflow. Flags that reach deprecated state are queued for automated cleanup, which removes the flag definition and the in-code references together.
Step 4 — Gate namespace ownership at the CI boundary
A namespace ownership check prevents team B from accidentally creating a flag in team A’s namespace, which is the root cause of most cross-service flag conflicts.
# .github/workflows/flag-namespace.yaml
name: Flag namespace ownership check
on: [pull_request]
jobs:
namespace-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check namespace ownership
run: |
for file in $(git diff --name-only HEAD~1 HEAD | grep 'flags/'); do
# Extract namespace (first segment of each flag key in the file)
namespaces=$(jq -r 'keys[]' "$file" | cut -d'.' -f1 | sort -u)
for ns in $namespaces; do
owner=$(jq -r ".\"$ns\"" .service-owners.json)
if [ "$owner" != "$GITHUB_REPOSITORY" ]; then
echo "ERROR: namespace '$ns' belongs to $owner"
exit 1
fi
done
done
Pitfall: shared namespaces like ops need an explicit allowlist in .service-owners.json that names multiple authorized repositories, or they will block every team from creating operational flags.
Verification & Testing
After publishing the schema, run a full audit of existing flags to measure compliance before enforcing hard failures:
# Count flags missing mandatory metadata
jq '[.[] | select(.metadata.owner == null or .metadata.expiry == null)] | length' \
flags/registry.json
# List flags past their expiry date (candidates for immediate deprecation)
jq --arg today "$(date +%F)" \
'[to_entries[] | select(.value.metadata.expiry < $today and .value.metadata.state == "active") | .key]' \
flags/registry.json
A passing baseline: zero flags with missing owner or expiry; zero active flags past their expiry date; every key matches the lint regex.
Troubleshooting & FAQ
How do I handle flags that genuinely have no expiry?
Only ops. and kill. prefix flags are permitted to have null expiry. Set "expiry": null and document the reason in the flag’s ticket field. Everything else must have a date. Treat any release flag with null expiry as a metadata error and reject it at CI time.
Our flag keys are already inconsistent across 30 services — where do we start?
Start with the linter in warn-only mode to inventory violations without blocking anyone. Export the violation list, assign cleanup tickets to owning teams sorted by flag age, and set a 6-week enforcement deadline. For keys that cannot be renamed without a multi-repo refactor, introduce a legacy_key alias field in the metadata and handle the rename as a two-step migration: add the new key, migrate call sites, then archive the old key.
Why track lifecycle state in metadata when the SDK already has an ENABLED/DISABLED toggle?
The SDK toggle is operational — it controls evaluation today. Lifecycle state is governance — it records intent. A deprecated flag might still be ENABLED in the SDK while the code removal PR is in review. Tracking both lets you query “all flags pending code removal” without assuming DISABLED means the same thing.