Polling vs Streaming Flag Synchronization

This guide is part of the Backend Evaluation & Server-Side SDKs series. Once you commit to local, in-process evaluation, the only open question is how the local rule set stays fresh: a server-side SDK either polls the control plane on an interval or holds a streaming connection that pushes changes as they happen. That single choice sets your propagation latency, your connection budget, and how fast a kill switch actually reaches every node.

Problem Framing: When the Transport Choice Matters

Transport is invisible until it isn’t. Polling on a 60-second interval means a flag flip — including an emergency rollback — can take up to a full minute to reach a given replica, and a fleet of 500 pods polling independently produces a steady drumbeat of requests against the control plane. Streaming collapses propagation to sub-second but holds a long-lived connection per process and demands disciplined reconnection logic.

This guide covers the decision and the wiring for both transports under OpenFeature. It does not cover the cache topology behind the SDK (see distributed caching for flag evaluations) or rule compilation (see optimizing rule engine performance).

Polling versus streaming propagation timelines Polling resolves a flag change only at the next interval tick, leaving a staleness window; streaming pushes the change immediately. Polling change applied at next tick — staleness window Streaming change pushed immediately — sub-second propagation
Polling applies a change only at the next interval tick; streaming pushes it the moment the control plane records it.

Prerequisites

Core Concept & Architecture

Both transports converge on the same local state — a compiled rule set the rule engine reads in-process. They differ only in how that state is refreshed. The decision matrix:

Dimension Polling Streaming (SSE)
Propagation latency Up to one interval Sub-second
Control-plane load Requests × replicas ÷ interval One open connection per replica
Resilience to blips Trivially stateless Needs reconnect + resync
Firewall friendliness High (plain HTTP) Lower (long-lived connection)
Best for Large fleets, relaxed SLAs Kill switches, fast canaries

A robust setup is rarely pure: stream for low-latency propagation, and keep a slow background poll as a safety net that heals missed events during a reconnect gap.

Step-by-Step Implementation

Step 1 — Configure streaming as the primary transport

Point the provider at the control plane and select the streaming resolver so changes arrive over a persistent connection.

# provider-sync.yaml — flagd sync configuration
sync:
  selector: "core"
  provider: streaming          # primary: push-based updates
  uri: "grpc://flagd.internal:8013"
  poll_interval_ms: 30000       # fallback poll heals missed events
cache:
  local_ttl: 0                  # 0 = trust the stream; no independent expiry
  fallback: last_known_good

Pitfall: setting a nonzero local_ttl alongside streaming creates two sources of truth — the stream says “current” while the TTL silently expires entries. Let the stream own freshness and use the poll only as a backstop.

Step 2 — Wire reconnection with backoff and resync

A streaming connection will drop. On reconnect, do a full resync rather than assuming you only missed the latest event.

import { OpenFeature } from '@openfeature/server-sdk';

provider.on('reconnecting', () => metrics.increment('flag.stream.reconnect'));
provider.on('ready', async () => {
  await provider.resyncFlags();           // pull full state, not just a delta
  metrics.gauge('flag.stream.connected', 1);
});
provider.on('error', () => metrics.gauge('flag.stream.connected', 0));

Pitfall: applying only the delta after a gap leaves a node permanently stale for any flag changed during the disconnect. Always resync the full set. See exponential backoff for SDK reconnection for the backoff curve.

Step 3 — Fall back to polling where streaming is impractical

Behind strict proxies or in serverless runtimes that recycle connections, a short poll is more reliable than a stream that constantly re-establishes.

# serverless-sync.yaml — polling-only profile
sync:
  provider: polling
  uri: "https://flagd.internal/flags"
  poll_interval_ms: 5000        # tighten interval to shrink the staleness window
cache:
  local_ttl: 5s
  fallback: last_known_good

Pitfall: every replica polling on the same fixed interval synchronizes into a thundering herd. Add jitter (±20%) to the interval so requests spread across the window.

Verification & Testing

Prove propagation latency rather than assuming it. Flip a canary flag and measure the time until every replica reports the new variant.

# Flip the flag, then poll each replica's evaluation endpoint until consistent
flagctl set web.dashboard.new-nav --variant on
for host in $(cat replicas.txt); do
  curl -s "$host/debug/flags/web.dashboard.new-nav" | jq -r '.variant'
done | sort -u   # expect a single line "on" within your latency budget

For streaming, assert reconnect behavior by killing the connection (block the port for 5s) and confirming a full resync fires on recovery.

Troubleshooting & FAQ

Why is a flag change not reaching some replicas?

On streaming, those replicas almost certainly lost the connection and resynced from a stale snapshot, or never reconnected — check your flag.stream.connected gauge. On polling, the change simply hasn’t reached the next tick yet, or the replica’s poll is failing silently; verify the poll response status.

How small can I make the polling interval?

Small enough to meet your propagation budget, large enough that replicas ÷ interval stays within the control plane’s request budget. Below ~1s, the request volume usually justifies switching to streaming instead.

Do I still need a cache if I’m streaming?

Yes — the local rule set the stream maintains is the cache. The stream keeps it fresh; the last_known_good fallback keeps evaluation working through a control-plane outage. Caching strategy is covered in distributed caching for flag evaluations.

Performance & Scale Considerations

Budget propagation as part of your rollout SLA, not as an afterthought. For a fleet of N replicas, polling costs roughly N ÷ interval requests per second against the control plane; streaming costs N persistent connections plus a burst of resyncs whenever the control plane redeploys. Above a few hundred replicas, prefer streaming with a generous fallback poll and stagger replica restarts so resyncs don’t arrive as one spike. Keep the evaluation path itself local so transport hiccups never add latency to a request — that separation is the whole point of backend evaluation.