Exponential Backoff for SDK Reconnection
This how-to is part of Server-Side SDK Integration Patterns. It solves one specific problem: when a server-side SDK loses its connection to the control plane, naively retrying immediately — or on a fixed interval — causes every replica in your fleet to hammer the control plane simultaneously the moment it comes back online. That thundering herd can delay recovery, push the control plane into another failure mode, and extend the window in which your flags serve stale state.
The fix is exponential backoff with full jitter: each replica waits a randomized, growing delay before each retry attempt. The delays spread across the recovery window so connection requests arrive as a gentle ramp rather than a synchronized spike.
Prerequisites
PROVIDER_RECONNECTINGandPROVIDER_READY)flag.sdk.reconnect_attemptcounter metric already wired to your observability stack- Polling vs streaming decision made — these patterns apply to streaming connections; polling replicas should jitter their poll interval instead
Step 1 — Detect a dropped connection
Listen to the provider’s lifecycle events. Most OpenFeature-compatible providers emit events when the connection drops, so you can track when a replica enters a degraded state and when it returns to healthy.
import { OpenFeature, ProviderEvents } from '@openfeature/server-sdk';
import { metrics } from './observability';
const client = OpenFeature.getClient('api');
OpenFeature.addHandler(ProviderEvents.Stale, () => {
metrics.increment('flag.sdk.connection_lost');
// Provider is now serving last-known-good; reconnection starts automatically
});
OpenFeature.addHandler(ProviderEvents.Reconnecting, (details) => {
metrics.increment('flag.sdk.reconnect_attempt', {
attempt: String(details?.attemptNumber ?? 0),
});
});
OpenFeature.addHandler(ProviderEvents.Ready, () => {
metrics.gauge('flag.sdk.connected', 1);
});
Pitfall: if your provider does not emit Stale events, you will not detect a silent connection drop. Confirm this by blocking the control-plane port for 10s and verifying the metric increments.
Step 2 — Retry with exponential backoff and full jitter
Implement the backoff loop manually if the provider does not include one, or configure the built-in parameters if it does. Full jitter picks the actual delay uniformly at random between zero and the exponential cap for that attempt.
// reconnect.ts
const BASE_MS = 1_000; // 1s base
const MAX_MS = 60_000; // 60s ceiling
const MAX_ATTEMPTS = 10;
function backoffDelayMs(attempt: number): number {
// Exponential cap: base * 2^attempt, capped at MAX_MS
const cap = Math.min(MAX_MS, BASE_MS * Math.pow(2, attempt));
// Full jitter: random value in [0, cap)
return Math.random() * cap;
}
async function reconnectWithBackoff(
reconnect: () => Promise<void>,
onExhausted: () => void,
): Promise<void> {
for (let attempt = 0; attempt < MAX_ATTEMPTS; attempt++) {
const delay = backoffDelayMs(attempt);
await sleep(delay);
try {
await reconnect();
return; // success — exit loop
} catch (err) {
metrics.increment('flag.sdk.reconnect_attempt', { attempt: String(attempt) });
if (attempt === MAX_ATTEMPTS - 1) onExhausted();
}
}
}
const sleep = (ms: number) => new Promise(r => setTimeout(r, ms));
The same logic in Go, using the gobreaker pattern already in your resilience layer:
// backoff.go
import (
"math"
"math/rand"
"time"
)
const baseMs = 1_000
const maxMs = 60_000
const maxRetries = 10
func backoffDelay(attempt int) time.Duration {
cap := math.Min(maxMs, float64(baseMs)*math.Pow(2, float64(attempt)))
jittered := rand.Float64() * cap // full jitter: [0, cap)
return time.Duration(jittered) * time.Millisecond
}
Pitfall: decorrelated jitter (sleep = min(cap, base * 3 * rand)) is an alternative that can spread arrivals slightly better, but full jitter is simpler and sufficient for flag SDK reconnection volumes.
Step 3 — Cap the maximum delay
Set MAX_MS to a value that balances recovery time against control-plane protection. A 60-second ceiling means at most a minute of stale state per replica in the worst case; dropping it to 30s doubles the reconnection request rate at the high-attempt end. Document the ceiling in your runbook so on-call engineers know the worst-case propagation window for a kill switch.
# flagd-provider-config.yaml
reconnect:
base_delay_ms: 1000
max_delay_ms: 60000
max_attempts: 10
jitter: full # "full" | "equal" | "none"
If the provider config exposes these as environment variables, set them at the pod level so they are visible to operators without a code change.
Step 4 — Resync the full flag state on reconnect
After a successful reconnect, pull the full rule set from the control plane rather than only applying the most recent delta. During the disconnection window your replica may have missed multiple flag changes; applying only the latest event leaves the rule set permanently inconsistent.
OpenFeature.addHandler(ProviderEvents.Ready, async () => {
// Full resync: re-download and compile the complete rule set
await provider.initialize(OpenFeature.getContext());
metrics.gauge('flag.sdk.connected', 1);
metrics.increment('flag.sdk.resync');
});
Cross-check against distributed caching for flag evaluations: if you have a second-level shared cache, invalidate or refresh the relevant partition on resync so the shared layer does not serve flags that the local layer already knows are stale.
Step 5 — Surface a connection metric for alerting
Export the connection state as a numeric gauge (1 = connected, 0 = disconnected) and alert when it stays at zero for more than two backoff intervals. This surfaces silent connection failures that would otherwise go unnoticed until an on-call engineer noticed wrong variants in traces.
// health.ts — expose for Prometheus scrape
import { register, Gauge } from 'prom-client';
const sdkConnected = new Gauge({
name: 'flag_sdk_connected',
help: '1 if the flag provider is connected to the control plane, 0 otherwise',
labelNames: ['provider'],
});
OpenFeature.addHandler(ProviderEvents.Ready, () => sdkConnected.set({ provider: 'flagd' }, 1));
OpenFeature.addHandler(ProviderEvents.Stale, () => sdkConnected.set({ provider: 'flagd' }, 0));
OpenFeature.addHandler(ProviderEvents.Error, () => sdkConnected.set({ provider: 'flagd' }, 0));
Verification
Simulate a control-plane outage and confirm the backoff curve behaves as expected:
# Block egress to the control plane on a single replica
iptables -A OUTPUT -p tcp --dport 8013 -j DROP
# Watch the reconnect_attempt counter in real time
watch -n 2 'curl -s http://localhost:9090/metrics | grep flag_sdk_reconnect_attempt'
# Confirm growing delays in the log
docker logs --since 2m my-api-pod | grep "flag.sdk.reconnect_attempt" | \
awk '{print $1, $NF}' # timestamp + attempt number
# Lift the block and confirm a full resync fires
iptables -D OUTPUT -p tcp --dport 8013 -j DROP
watch -n 1 'curl -s http://localhost:9090/metrics | grep flag_sdk_connected'
# expect: flag_sdk_connected{provider="flagd"} 1 within MAX_MS of the block lift
Gotchas & Edge Cases
- Reconnect while serving traffic: the provider should keep serving
last_known_goodvariants during the backoff loop. Confirm your SDK does not set the state toErrorand start returning empty defaults while retrying — that would be worse than stale. - Max attempts exhausted: if the loop hits
MAX_ATTEMPTS, decide whether to crash the pod (triggering Kubernetes restart logic and a fresh connection) or log an alert and continue on stale state indefinitely. Crashing is usually safer for stateless services. - Clock jitter on containerized hosts:
Math.random()seeds from the OS entropy pool; in containers with identical images and identical start times the seeds can be correlated. Use a time-seeded or hardware-entropy source if you see replicas retrying at suspiciously similar intervals.
Troubleshooting & FAQ
The replicas are all retrying at the same time despite jitter
Full jitter in [0, cap) is uniform but still has a nonzero chance of many replicas picking small values simultaneously, especially at attempt 0 where cap is small. Add a fixed per-pod offset derived from the pod name or IP: delay = backoffDelayMs(attempt) + hashToBucket(podName, 500). This spreads the first retry attempt across a 500ms window.
Reconnects succeed but some flags are wrong after recovery
The provider reconnected but did not resync the full state — it applied only the last delta. Check Step 4: provider.initialize() must be called on every Ready event after a Stale transition, not only on the first boot. If the provider SDK does not expose this call, file an issue or fetch the rule-set snapshot manually via the control-plane REST API.
How do I set MAX_ATTEMPTS correctly for my SLA?
Estimate the maximum acceptable outage duration for the control plane and divide by your expected MAX_MS. For a 10-minute SLA with a 60s ceiling: 10 * 60 / 60 = 10 attempts, which is the default above. Adjust BASE_MS and MAX_MS together so the last attempt lands near the end of your acceptable window.