Cache Invalidation Strategies for Flag Updates
This how-to is part of Distributed Caching for Flag Evaluations.
A flag changed in the control plane but cached evaluations keep returning the old variant — sometimes for minutes. When that old variant represents a percentage rollout that was just reduced, or a feature that was just disabled via a kill switch, the propagation gap directly expands blast radius. The problem is choosing an invalidation model that meets your propagation latency budget without generating a thundering herd of simultaneous refreshes, and then verifying it actually delivers changes within that budget.
Prerequisites
Step 1 — Choose an invalidation model: TTL, event-driven, or versioned keys
Three models cover the design space for cache invalidation. TTL-only expiry sets a fixed lifetime on every cache entry; changes propagate at most TTL seconds late. It requires no external coordination and is trivial to operate, but propagation latency is bounded only by that TTL value. Event-driven pub/sub deletes or overwrites the cache key immediately when the control plane emits a flag-change event; propagation can be sub-second but the mechanism depends on reliable event delivery. Versioned keys embed a monotonic version stamp in the key name itself — for example flags:checkout.payments.express-pay:v42 — so a version increment makes old keys unreachable without explicit deletes, which is useful when you cannot guarantee atomic deletes across replicas.
All three can be layered: event-driven for fast invalidation, versioned keys for stale-write detection, and TTL as the backstop that catches anything the event layer misses. The decision table below captures the tradeoffs concisely.
# Invalidation model selection guide
# Flag key used throughout: checkout.payments.express-pay
models:
ttl_only:
propagation_latency: "≤ TTL (e.g. 30–60 s)"
operational_complexity: low
failure_mode: "stale entries until TTL expires"
choose_when: "propagation budget is relaxed (> 30 s) and ops simplicity is paramount"
event_driven_pubsub:
propagation_latency: "< 1 s typical"
operational_complexity: medium
failure_mode: "missed events leave stale keys indefinitely — requires TTL backstop"
choose_when: "tight propagation budget (< 5 s) or kill-switch scenarios"
versioned_keys:
propagation_latency: "bounded by version-meta TTL (e.g. 60 s)"
operational_complexity: medium
failure_mode: "version meta-key can itself go stale"
choose_when: "you need stale-write detection across replicas without reliable pub/sub"
combined:
propagation_latency: "< 1 s typical; TTL as backstop"
operational_complexity: medium-high
failure_mode: "highest resilience — events + TTL + version guards all active"
choose_when: "production kill-switch paths or propagation budget < 5 s"
Choosing event-driven alone without a TTL backstop means a missed or dropped event leaves a stale key in cache indefinitely. Configure a TTL as a failsafe even when pub/sub is your primary invalidation path. For the transport layer that generates these events, see Polling vs Streaming Flag Synchronization — pub/sub invalidation is the caching analogue of streaming synchronization.
Step 2 — Wire pub/sub invalidation on flag-change events
Subscribe to control-plane change events and delete the cache entry when a flag changes. Use the flag key as the pub/sub message payload so the subscriber knows exactly which cache key to invalidate without pattern-matching or scanning. The publisher side is typically a webhook handler inside your flag management service: on each write, publish a JSON message to the flag_updates channel containing the flag key and the new version number.
// subscriber.ts — Redis pub/sub invalidation handler (ioredis)
import Redis from 'ioredis';
const subscriber = new Redis({ host: 'cache.internal', port: 6379 });
const writer = new Redis({ host: 'cache.internal', port: 6379 });
// In-process LRU cache cleared alongside Redis
import { inProcessCache } from './in-process-cache';
subscriber.subscribe('flag_updates', (err) => {
if (err) throw new Error(`Subscribe failed: ${err.message}`);
});
subscriber.on('message', async (_channel: string, raw: string) => {
const { flagKey, version } = JSON.parse(raw) as { flagKey: string; version: number };
const redisKey = `flags:${flagKey}`;
// Delete from Redis — next read will re-fetch from control plane
await writer.del(redisKey);
// Also clear the in-process tier if present
inProcessCache.delete(flagKey);
console.info(`Invalidated ${redisKey} at version ${version}`);
});
// publisher side (control plane webhook handler)
async function onFlagChanged(flagKey: string, newVersion: number): Promise<void> {
const message = JSON.stringify({
flagKey, // e.g. "checkout.payments.express-pay"
version: newVersion,
});
await writer.publish('flag_updates', message);
}
If the subscriber process crashes or restarts, it misses events that arrived during the gap. On reconnect, perform a full flush of all flag cache keys rather than assuming the current state is current. Resuming from an unknown gap is unsafe — treat reconnect the same as a cold start. Note that Memcached has no native pub/sub mechanism; if your backing store is Memcached, event-driven invalidation must be handled entirely at the application layer. For a direct comparison of what each store supports, see Redis vs Memcached for Feature Flag Caching.
Step 3 — Add a version stamp to detect and reject stale writes
TTL expiry and pub/sub deletion both remove stale entries reactively. Version stamps add a proactive check: before trusting a cached entry, confirm it is not behind the current control-plane version. The control plane writes a lightweight meta:flag-version key on every flag change — a single integer that any subscriber can read cheaply. On cache read, fetch this meta key (it carries its own short TTL, e.g. 60 s) and compare it against the version embedded in the cached value. A mismatch is treated as a cache miss.
// flag-reader.ts — version-stamped cache read
import Redis from 'ioredis';
import { fetchFromControlPlane } from './control-plane-client';
const redis = new Redis({ host: 'cache.internal', port: 6379 });
interface CachedFlag {
flagConfig: unknown;
version: number;
}
async function getFlag(flagKey: string): Promise<unknown> {
const [rawEntry, rawMeta] = await Promise.all([
redis.get(`flags:${flagKey}`),
redis.get('meta:flag-version'),
]);
const currentVersion = rawMeta ? parseInt(rawMeta, 10) : null;
if (rawEntry && currentVersion !== null) {
const cached = JSON.parse(rawEntry) as CachedFlag;
if (cached.version >= currentVersion) {
return cached.flagConfig; // cache hit — version is current
}
// Version mismatch — treat as miss
console.warn(`Stale cache entry for ${flagKey}: cached v${cached.version} < current v${currentVersion}`);
}
// Cache miss or stale — re-fetch from control plane
const { flagConfig, version } = await fetchFromControlPlane(flagKey);
const ttl = 60 + Math.floor(Math.random() * 12);
await redis.set(`flags:${flagKey}`, JSON.stringify({ flagConfig, version }), 'EX', ttl);
return flagConfig;
}
The version meta-key is itself a cached value with a TTL, which means it can go stale if the control plane misses a write. Use the version stamp as a staleness signal and an optimization, not as the sole authority. The TTL backstop from Step 1 remains the final guard if both pub/sub and version checks fail.
Step 4 — Bound TTL as a backstop for missed events
A TTL on every cache entry is non-optional. Even with event-driven invalidation wired correctly, network partitions, subscriber crashes, and deploy gaps will produce intervals where no pub/sub message arrives. The TTL is the safety net that bounds how long a stale entry can survive without any active invalidation signal.
Choose the TTL to equal or slightly exceed your propagation latency budget — not the shortest value that feels safe. A 5 s TTL on a deployment with 500 cache-reading replicas generates 100 requests per second to the control plane at steady state from TTL expiry alone, before any flag actually changes. Set the TTL to match the budget (e.g. 60 s if “changes visible within 60 s” is acceptable), and add per-entry jitter to prevent synchronized expiry across replicas from creating a thundering herd.
// cache-writer.ts — TTL with jitter to spread expiry
import Redis from 'ioredis';
const redis = new Redis({ host: 'cache.internal', port: 6379 });
async function writeFlag(
flagKey: string, // e.g. "api.search.semantic-rerank"
flagConfig: unknown,
version: number,
baseTtlSeconds = 60,
): Promise<void> {
// ±10–15% jitter spreads expiry across replicas
const jitter = Math.floor(Math.random() * Math.ceil(baseTtlSeconds * 0.15));
const ttl = baseTtlSeconds + jitter; // e.g. 60–69 s
await redis.set(
`flags:${flagKey}`,
JSON.stringify({ flagConfig, version }),
'EX',
ttl,
);
}
Synchronizing TTL expiry across all replicas — no jitter — causes every replica to miss simultaneously, sending a spike of simultaneous control-plane fetches every TTL seconds. Jitter spreads that load across the TTL window. For the broader context of how this TTL strategy fits into the full cache topology, see Distributed Caching for Flag Evaluations.
Verification
Flip a flag and confirm the cache refreshes within the propagation budget.
# 1. Record the current version
VERSION=$(redis-cli GET "meta:flag-version")
# 2. Flip the flag in the control plane
flagctl set checkout.payments.express-pay --variant off --env prod
# 3. Poll Redis until the cached entry reflects the new version or is absent
for i in $(seq 1 20); do
CACHED=$(redis-cli GET "flags:checkout.payments.express-pay" \
| python3 -c "import sys,json; d=json.load(sys.stdin); print(d['version'])" 2>/dev/null)
echo "Attempt $i: cached version=$CACHED target=$(($VERSION + 1))"
[ "$CACHED" = "$(($VERSION + 1))" ] && echo "PASS — refreshed within budget" && break
sleep 0.5
done
Expected outcome: the key is absent (pub/sub delete fired) or carries the incremented version within your propagation budget — for example, within 5 s for a tight budget. If the loop exhausts all 20 attempts (10 s), the pub/sub subscriber likely missed the event or the in-process cache layer is masking the Redis update.
Gotchas and edge cases
-
Fan-out cost of pub/sub at scale. If 500 replicas each subscribe to the same pub/sub channel, a single flag change triggers 500 simultaneous cache reads to the control plane. Stagger re-fetches with per-replica jitter, or use a write-back pattern where one designated process re-fetches and writes the new value to Redis while other replicas read from the refreshed cache key rather than going to the control plane directly.
-
In-process cache shadowing external invalidation. If you layer an in-process LRU on top of Redis, deleting the Redis key does not clear the in-process copy. That in-process copy will continue serving the stale variant until its own TTL expires. Your pub/sub handler must explicitly clear both tiers on every invalidation event, as shown in Step 2.
-
Version rollback on control-plane redeploy. A control-plane rollback can decrease the version counter, making cached entries appear “newer” than what the control plane reports. Guard against this by storing a wall-clock timestamp alongside the version number and treating any version mismatch — not just increases — as a cache miss. Do not assume version numbers are strictly monotonically increasing across all deployment scenarios.
FAQ
The pub/sub event fired but some replicas are still serving the old variant — why?
Those replicas either missed the pub/sub event (the subscriber was restarting or briefly disconnected) or they have an in-process cache layer that was not cleared by the invalidation handler. Check subscriber connection state with redis-cli CLIENT LIST and confirm that your handler explicitly calls inProcessCache.delete(flagKey) in addition to redis.del(redisKey). A reconnect without a full cache flush is the most common root cause — ensure your reconnect logic triggers a complete flush of all flag keys before resuming normal operation.
How do I test invalidation without disrupting production traffic?
Use a flag that targets a synthetic user context not present in real traffic — for example, a targeting rule that matches environment: "cache-invalidation-test". Flip it in the control plane and measure propagation time in a staging or shadow environment before relying on the mechanism for production kill switch paths. Synthetic contexts make it safe to flip repeatedly without touching real user sessions.
My TTL-only setup meets the latency budget 95% of the time. Do I need pub/sub?
Only if your budget is tight enough that the 5% tail matters. If a kill switch needs to propagate in under 2 s, a 30 s TTL alone will not satisfy that requirement in even the best case — pub/sub is necessary. If your propagation budget is “within 60 s” and your kill-switch runbook can tolerate up to a 60 s window, a 30 s TTL with jitter is usually sufficient without the additional operational complexity of a pub/sub subscriber process. Quantify the worst-case tail against your actual incident budget before committing to either model. For context on how control-plane transport choices affect propagation latency, see Polling vs Streaming Flag Synchronization.