Distributed Caching for Flag Evaluations

This guide is part of the Backend Evaluation & Server-Side SDKs series.

Local, in-process evaluation resolves flags in sub-millisecond reads against a pre-compiled rule set held in heap memory. That approach works well on a single server: the rule set is loaded once, kept current by a background synchronization thread, and evaluated without any network round-trip. The problem appears the moment you scale horizontally. Each replica independently caches flag state, and a cache miss on any one of them collapses to a synchronous call to the control plane. Under load — a rolling deploy, an autoscale event, a sudden traffic spike — those simultaneous misses can overwhelm the control plane and introduce latency spikes that propagate back to your users.

The solution is a shared backing cache that sits between the in-process layer and the control plane. Misses that the in-process cache cannot serve hit the shared layer instead of the control plane, and the control plane is queried only when both local and shared caches are cold or when the provider receives a push notification through a streaming channel (see Polling vs Streaming Flag Synchronization). This guide covers how to wire that two-level topology, how to protect it against stampede conditions, and how to configure a last-known-good fallback so a control-plane outage does not degrade evaluation latency for your users.

What this guide covers: in-process versus external cache topologies, TTL strategy and consistency trade-offs, stampede protection through request coalescing and probabilistic early expiry, and last-known-good fallback for control-plane outages.

What this guide does not cover: choosing the right backing store (Redis versus Memcached — see Redis vs Memcached for Feature Flag Caching) or propagating flag changes to cached nodes after a publish event (see Cache Invalidation Strategies for Flag Updates).

Prerequisites

Core concept and architecture

Three layers handle every evaluation request, ordered by latency and coupling:

Layer 1 — In-process LRU cache. Each replica holds a small, bounded cache in heap memory. Lookups are pure in-process reads: no serialization, no network, no locking under normal conditions. The TTL is short — 5 to 15 seconds — because this layer trades freshness for speed. The vast majority of evaluation requests terminate here.

Layer 2 — Shared external cache. When the in-process TTL expires, the miss is forwarded to a shared cache (typically Redis) that all replicas read from and write to. The TTL here is longer — 60 to 300 seconds — because the cost of a miss is a network round-trip to the cache, not to the control plane. This layer absorbs the coordination work that would otherwise require every replica to independently fetch from the authoritative source.

Layer 3 — Control plane. Queried only on a shared-cache miss, or when the provider receives a push notification from a streaming channel. The response is written back to both the shared cache and the in-process cache before being returned to the caller.

The diagram below illustrates how traffic flows across all three layers and where TTL boundaries sit.

Distributed cache topology for feature flag evaluations Two service replicas each with an in-process cache point to a shared Redis cache, which in turn points to the control plane. Annotations show TTL ranges and miss conditions. Service replica 1 In-process cache TTL 5–15 s Service replica 2 In-process cache TTL 5–15 s Shared cache (Redis) TTL 60–300 s Shared across all replicas Control plane Authoritative source Queried on cache miss only miss (< 5 s TTL) miss (< 5 s TTL) miss (< 300 s TTL)
Cache topology: in-process LRU per replica feeds from a shared Redis cluster, which falls back to the control plane only on a true miss.

The rule engine performance guide covers how the in-process rule set itself is compiled and stored — this guide assumes that layer is already in place and focuses on the caching structure above it.

Step-by-step implementation

Step 1 — Wire a two-level cache (in-process and shared)

The read-aside pattern is the simplest correct approach: check the in-process cache first, then the shared cache, then the provider, and populate both layers on the way back up.

import { OpenFeature } from "@openfeature/server-sdk";
import { createClient } from "redis";
import { LRUCache } from "lru-cache";

const FLAG_KEY = "api.catalog.new-search-ranking";

// In-process layer: 500-entry LRU, 10-second TTL
const localCache = new LRUCache<string, boolean>({
  max: 500,
  ttl: 10_000, // milliseconds
});

// Shared layer: Redis connection (inject from your DI container)
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();

const SHARED_TTL_SECONDS = 120;

async function evaluateFlag(
  flagKey: string,
  defaultValue: boolean,
  context: Record<string, string>
): Promise<boolean> {
  const cacheKey = `flags:${flagKey}:${context.userId ?? "anon"}`;

  // Layer 1: in-process LRU
  const local = localCache.get(cacheKey);
  if (local !== undefined) {
    flagCacheCounter.inc({ layer: "local", result: "hit" });
    return local;
  }
  flagCacheCounter.inc({ layer: "local", result: "miss" });

  // Layer 2: shared Redis cache
  const shared = await redis.get(cacheKey);
  if (shared !== null) {
    const value = shared === "true";
    localCache.set(cacheKey, value); // backfill in-process layer
    flagCacheCounter.inc({ layer: "shared", result: "hit" });
    return value;
  }
  flagCacheCounter.inc({ layer: "shared", result: "miss" });

  // Layer 3: provider fetch
  const client = OpenFeature.getClient();
  const value = await client.getBooleanValue(flagKey, defaultValue, context);

  // Populate both layers
  await redis.setEx(cacheKey, SHARED_TTL_SECONDS, String(value));
  localCache.set(cacheKey, value);

  return value;
}

Pitfall: Setting the in-process TTL equal to or longer than the shared-cache TTL eliminates the benefit of the local layer. Keep the in-process TTL at 5–15 seconds and the shared-cache TTL at 60–300 seconds. The gap between them is what absorbs miss volume before it reaches Redis.

Step 2 — Add stampede protection with request coalescing

A cold cache entry — especially just after a deploy or a TTL expiry — triggers a stampede: every concurrent request that misses the cache races to fetch the same value from the provider. Coalescing resolves this by tracking in-flight fetches per cache key and having subsequent callers wait on the same Promise rather than issuing their own upstream call.

const inFlight = new Map<string, Promise<boolean>>();

async function evaluateFlagCoalesced(
  flagKey: string,
  defaultValue: boolean,
  context: Record<string, string>
): Promise<boolean> {
  const cacheKey = `flags:${flagKey}:${context.userId ?? "anon"}`;

  // Fast path: in-process hit
  const local = localCache.get(cacheKey);
  if (local !== undefined) return local;

  // Shared-cache hit
  const shared = await redis.get(cacheKey);
  if (shared !== null) {
    const value = shared === "true";
    localCache.set(cacheKey, value);
    return value;
  }

  // Already fetching? Join the existing Promise.
  const existing = inFlight.get(cacheKey);
  if (existing) return existing;

  // First caller: own the fetch
  const fetch = (async () => {
    try {
      const client = OpenFeature.getClient();
      const value = await client.getBooleanValue(flagKey, defaultValue, context);
      await redis.setEx(cacheKey, SHARED_TTL_SECONDS, String(value));
      localCache.set(cacheKey, value);
      return value;
    } finally {
      inFlight.delete(cacheKey); // always clean up, even on error
    }
  })();

  inFlight.set(cacheKey, fetch);
  return fetch;
}

Pitfall: If you do not delete the in-flight Promise in a finally block, a failed fetch leaves a permanently settled rejection in the map. Every subsequent call for that key will resolve immediately to the rejected Promise and never retry. The finally block is non-negotiable.

Step 3 — Configure a last-known-good fallback

When both cache layers are cold and the control plane is unreachable — network partition, deployment, overload — the provider call throws. Instead of returning the hard-coded SDK default (which may be the wrong safe value for your context), capture the most recent successfully resolved value and serve it.

const lastKnownGood = new Map<
  string,
  { value: boolean; resolvedAt: number }
>();

const LAST_KNOWN_GOOD_MAX_AGE_MS = 10 * 60 * 1000; // 10-minute hard cap

async function evaluateFlagWithFallback(
  flagKey: string,
  defaultValue: boolean,
  context: Record<string, string>
): Promise<boolean> {
  const cacheKey = `flags:${flagKey}:${context.userId ?? "anon"}`;

  // Fast path: in-process
  const local = localCache.get(cacheKey);
  if (local !== undefined) {
    lastKnownGood.set(cacheKey, { value: local, resolvedAt: Date.now() });
    return local;
  }

  // Shared cache
  const shared = await redis.get(cacheKey).catch(() => null);
  if (shared !== null) {
    const value = shared === "true";
    localCache.set(cacheKey, value);
    lastKnownGood.set(cacheKey, { value, resolvedAt: Date.now() });
    return value;
  }

  // Provider fetch — may throw if control plane is unreachable
  try {
    const client = OpenFeature.getClient();
    const value = await client.getBooleanValue(flagKey, defaultValue, context);
    await redis.setEx(cacheKey, SHARED_TTL_SECONDS, String(value));
    localCache.set(cacheKey, value);
    lastKnownGood.set(cacheKey, { value, resolvedAt: Date.now() });
    return value;
  } catch {
    const lkg = lastKnownGood.get(cacheKey);
    const age = lkg ? Date.now() - lkg.resolvedAt : Infinity;

    if (lkg && age < LAST_KNOWN_GOOD_MAX_AGE_MS) {
      lkgServedCounter.inc({ flag: flagKey }); // alert when this fires
      return lkg.value;
    }

    // LKG too old or never populated — fall back to hard default
    return defaultValue;
  }
}

Pitfall: Without a maximum age cap, last-known-good values can grow arbitrarily stale during a prolonged outage. A service that was deployed with a flag enabled six hours ago and whose control plane has been unreachable since will continue serving the stale value indefinitely. Bound the age (10 minutes is a reasonable starting point), emit a metric or log when the cap is exceeded, and set an alert on that signal so on-call engineers know to investigate.

Verification and testing

Verify that the two-level cache is operating correctly by checking hit rates and validating fallback behavior under simulated failure conditions.

Cache hit rate. Instrument flag.cache.hit and flag.cache.miss counters per layer. On a steady-state service that has been running for more than one shared-cache TTL cycle, you should expect a local hit rate above 90% and a shared-cache hit rate above 99% of remaining misses. A hit rate well below those thresholds usually means the TTL is set too short relative to your request interval, or the flag key space is larger than the in-process cache capacity.

Stampede coalescing. Flush one shared-cache entry and send a burst of concurrent requests while observing upstream provider call counts:

# Flush a single flag entry from the shared cache
redis-cli DEL "flags:api.catalog.new-search-ranking:user123"

# Observe provider call count — expect exactly 1 upstream fetch per burst
# regardless of how many concurrent callers arrived simultaneously

Last-known-good fallback. Block the provider endpoint (or bring it down in a local test environment) and verify that evaluation still returns the previously cached value, not the hard-coded default:

# In your integration test: shut down the mock provider server
# then invoke evaluateFlagWithFallback and assert the result equals
# the last value the provider returned, not the defaultValue argument

Add an integration test that captures the lkg.resolvedAt timestamp, advances it past the hard cap, and asserts that the hard default is returned rather than the expired last-known-good.

Troubleshooting & FAQ

Why does p99 evaluation latency spike during a rolling deploy?

New pods start with cold in-process caches. If several replicas restart simultaneously — as happens during a rolling deploy with a small replica count — the initial evaluation requests from all of them miss the local layer and hit the shared cache at once. If the shared cache is also cold (a full restart or a cache flush), all those misses escalate to the control plane simultaneously.

Mitigate this by pre-warming the shared cache before scaling up: run a background job that evaluates the full set of active flag taxonomy entries against representative contexts and writes the results to Redis. Alternatively, stagger pod restarts using maxUnavailable: 1 in your Kubernetes rollout strategy so that at most one replica is cold at a time, limiting the miss volume to a fraction of total traffic.

How do I verify the in-process cache is actually being hit?

Add explicit counters to the cache-check path — a flag.cache.hit{layer="local"} increment when localCache.get returns a defined value, and a flag.cache.miss{layer="local"} increment otherwise. Expose these through your metrics endpoint (Prometheus, StatsD, or CloudWatch) and watch the ratio in a dashboard.

A hit rate below roughly 90% on a steady-state service typically indicates one of three conditions: the in-process TTL is shorter than the average interval between requests for the same key-context pair (in which case raise the TTL to 10–15 seconds); the flag key space combined with the context cardinality is too large for the allocated LRU capacity (either increase max or reduce context cardinality by pre-hashing high-cardinality fields); or context enrichment is producing different cache key strings for semantically identical requests — normalize context fields before constructing the cache key.

What should I do when last-known-good values expire?

When a last-known-good value exceeds the maximum age cap, the code path falls back to the hard-coded SDK default. The correct response depends on what the flag controls. For flags that gate a non-critical enhancement, the hard default (typically false) is safe. For flags that control a critical path that was previously enabled, falling back to false may break the feature entirely.

Document the intended fallback behavior in the flag’s metadata so on-call engineers do not have to reverse-engineer intent during an incident. Emit a distinguishable log line and metric when the hard default is served due to LKG expiry — treat it as a circuit-breaker signal, not a silent fallback. If your service level objective cannot tolerate the hard default, configure an alert on the LKG-expired counter and include a runbook entry for the control-plane recovery procedure.

Performance and scale

In-process cache sizing. A typical flag evaluation payload — the resolved boolean or string value plus associated metadata — is 500 bytes to 2 KB serialized. At 1 KB per entry, a 10 MB heap allocation holds roughly 10,000 cache entries. For services that evaluate many distinct flag-context combinations (high user cardinality), measure the actual payload size in your environment and size the LRU accordingly. Over-allocating wastes heap; under-allocating turns the in-process cache into a constant churn layer that provides no hit rate benefit.

Shared cache connection pooling. Each service replica should maintain a persistent connection pool to the Redis cluster rather than opening connections per request. A pool of 5 to 20 connections per replica is appropriate for most flag-evaluation traffic patterns. Too few connections serialize reads and create queuing latency; too many exhaust Redis’s file descriptor limit across all replicas. Monitor redis_connected_clients against your Redis instance’s maxclients setting and adjust the per-replica pool ceiling accordingly.

Hot key mitigation. A single heavily-evaluated flag — particularly a global kill switch or a top-level feature gate — can generate enough Redis reads to saturate a single slot in a Redis Cluster deployment. Two mitigations work well in combination: add a small random jitter (±10–20%) to each entry’s TTL so that expiries do not synchronize across replicas, and keep the in-process TTL long enough to absorb most traffic before it reaches Redis at all. If a single flag still generates pathological Redis traffic, consider a local short-circuit: skip the shared-cache round-trip entirely for the first several seconds after any successful in-process fetch for that key.

Benchmark targets. These are reasonable targets for a well-tuned two-level cache setup: in-process hit under 0.1 ms end-to-end; shared cache hit under 2 ms (dominated by network RTT to Redis); provider fetch 10–50 ms depending on control-plane deployment topology. Measure at the 99th percentile under realistic concurrency — median latency will flatter any implementation.