Reducing Flag Evaluation Latency to Under 5ms
This how-to is part of Optimizing Rule Engine Performance. It targets the final kilometre: you already have a working server-side SDK and a local rule set, but p99 is drifting past 5 ms under load. The interventions below are ordered by impact-to-effort ratio; apply them in sequence and re-benchmark after each.
The problem is almost never the network. Once a server-side SDK evaluates in-process, the remaining latency comes from three sources: re-parsing rule definitions on each call, oversized context payloads that inflate serialisation and attribute traversal time, and garbage-collector pauses caused by per-evaluation heap allocation. All three are avoidable.
Prerequisites
namespace.service.featurekeysflag_eval_duration_secondsscraping with sub-millisecond bucketsgo tool pprof, Node.js--prof, orpy-spyfor Python)
Step 1 — Measure, then isolate the evaluation span
Do not optimise before you know which phase dominates. Standard SDK telemetry often aggregates initialisation and network time with evaluation time, masking where the latency actually lives.
import { OpenFeature } from '@openfeature/server-sdk';
const client = OpenFeature.getClient();
async function measureEval(flagKey: string, ctx: Record<string, unknown>): Promise<boolean> {
const start = process.hrtime.bigint();
const result = await client.getBooleanValue(flagKey, false, ctx);
const ns = Number(process.hrtime.bigint() - start);
if (ns > 5_000_000) {
console.warn('eval breach', { flagKey, latencyMs: (ns / 1e6).toFixed(3) });
}
return result;
}
Instrument this wrapper in production for one traffic window, then group the breach logs by flagKey. A flag with disproportionate latency typically has a complex rule, regex predicate, or per-call parse overhead. Fix the outliers first — they move p99 without touching the median.
Cross-link: once you know which phase dominates, the rule engine optimisation guide covers AST compilation, short-circuit logic, and regex elimination in depth.
Step 2 — Pre-compile rules; eliminate per-call JSON parsing
Re-parsing a JsonLogic document on every evaluation is the single largest avoidable cost. Move parsing to provider initialisation and store the compiled tree keyed by flag key + version.
from openfeature import api
from openfeature.evaluation_context import EvaluationContext
import json, hashlib
# Module-level compiled rule cache — populated at startup, updated on flag change
_compiled: dict[str, object] = {}
def _compile_rule(rule_json: str, version: int) -> object:
"""Parse once; store by key+version. Called only at init or on flag update."""
rule = json.loads(rule_json)
# compile to your AST or closure here — see precompiling guide for full pattern
return _build_ast(rule)
def load_flags(flag_defs: list[dict]) -> None:
for f in flag_defs:
cache_key = f"{f['key']}@{f['version']}"
if cache_key not in _compiled:
_compiled[cache_key] = _compile_rule(f["targeting"], f["version"])
The evaluation function then performs only a hash-map lookup — no JSON operations on the hot path. For the full incremental-update pattern, see precompiling targeting rules into an AST.
Step 3 — Trim the context payload to only targeting keys
Every attribute in the evaluation context is examined during rule traversal. A context carrying 40 keys where the rule reads only 3 wastes iteration on every call.
# Declare the exact keys each rule reads — enforced at context build time
TARGETING_KEYS = frozenset({"userId", "region", "tenantTier", "environment"})
def build_lean_context(raw: dict) -> dict:
"""Strip non-targeting keys before they reach the rule engine."""
return {k: raw[k] for k in TARGETING_KEYS if k in raw}
// TypeScript variant — strip at the API layer, not inside the engine
const TARGETING_KEYS = new Set(['userId', 'region', 'tenantTier', 'environment']);
function buildContext(req: Request): Record<string, unknown> {
const raw = extractAttributes(req);
return Object.fromEntries(
Object.entries(raw).filter(([k]) => TARGETING_KEYS.has(k))
);
}
Aim for ≤ 10 context keys. Benchmarks consistently show 30–45% latency reduction when context payloads drop from 40+ keys to the targeting subset.
Step 4 — Add an in-process LRU cache for stable context hashes
Many requests carry identical evaluation contexts — the same user, same region, same tier. A small in-process LRU cache keyed on a deterministic context hash short-circuits the rule engine entirely for repeated contexts.
import hashlib
from collections import OrderedDict
_CACHE_MAX = 5_000 # tune to your working-set size
_lru: OrderedDict[str, bool] = OrderedDict()
def evaluate_with_cache(flag_key: str, ctx: dict) -> bool:
lean = build_lean_context(ctx)
ctx_hash = hashlib.sha256(
(flag_key + str(sorted(lean.items()))).encode()
).hexdigest()[:16]
if ctx_hash in _lru:
_lru.move_to_end(ctx_hash) # maintain LRU order
return _lru[ctx_hash]
result = _run_compiled_rule(flag_key, lean)
if len(_lru) >= _CACHE_MAX:
_lru.popitem(last=False) # evict least-recently-used
_lru[ctx_hash] = result
return result
Pitfall: a cache keyed on the raw context (before trimming) will miss far more often than one keyed on the targeting-key subset. Trim first, then hash. Also invalidate the LRU whenever load_flags runs — stale cache entries from a previous rule version can serve the wrong variant.
For distributed caching across multiple replicas, the same key scheme applies to Redis or Memcached, but the in-process LRU should remain the primary layer — remote cache calls add network jitter.
Verification Step
Run a Go benchmark (or your language’s equivalent) comparing the optimised path against the baseline:
// go test -bench=BenchmarkEval -benchtime=10s -count=3
func BenchmarkEvalOptimised(b *testing.B) {
ctx := map[string]interface{}{
"tenantTier": "enterprise",
"region": "us-east-1",
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
evaluateWithCache("api.search.semantic-rerank", ctx)
}
}
Target: median (p50) ≤ 0.5 ms, p99 ≤ 2 ms in the benchmark. In production, confirm with the flag_eval_duration_seconds histogram — the 0.005-second bucket should capture ≥ 99% of observations.
# Check the 99th percentile in Prometheus
curl -sg 'http://prometheus:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.99, rate(flag_eval_duration_seconds_bucket[5m]))' \
| jq '.data.result[].value[1]'
Gotchas & Edge Cases
- Cold-start evals: The first evaluation after a pod restart hits the compiled-rule path before the LRU is warm. Warm the cache with a synthetic context sweep at startup if cold-start latency is in your SLO.
- Flag update invalidation: When
load_flagsrecompiles a rule, any LRU entries for that flag’s previous version become stale. The safest approach is to include the rule version in the cache key; mismatches naturally fall through to a fresh evaluation. - Regex in custom hooks: If your evaluation context uses a custom targeting hook that calls an external service or performs regex matching, no amount of AST optimisation will save you — that hook runs on the hot path. Move external lookups to context enrichment at the API boundary, not inside the hook.
Troubleshooting & FAQ
Pre-compiled rules are set up but p99 is still high — what else could it be?
After ruling out re-parsing, look for regex predicates in targeting rules, context payloads over 10 keys, or synchronous I/O inside a custom hook. Run a CPU profile (py-spy top, go tool pprof, or Node.js --prof) against a load-test and check which functions appear on the hot path.
The LRU cache is hitting nearly 100% — is that right?
A high hit rate is exactly the goal, but confirm you are not serving stale variants after a flag change. Add a flag_cache_evictions_total counter that increments whenever load_flags clears entries for a given key, and correlate it with flag update events from your control plane.
How do I set a p99 alert without noisy false positives?
Use a 5-minute rate window for the histogram and for: 2m in the alert rule so transient GC spikes do not page you. Gate the alert on a minimum request rate (e.g. rate(flag_eval_duration_seconds_count[5m]) > 10) so a quiet pod does not fire.