Polling vs Streaming Flag Synchronization

Q: Why is a flag change not reaching some replicas?

On streaming, those replicas likely lost the connection and resynced from a stale snapshot or never reconnected; check the connected gauge. On polling, the change has not reached the next tick yet or the poll is failing silently.

Q: How small can I make the polling interval?

Small enough to meet your propagation budget but large enough that replicas divided by interval stays within the control plane's request budget. Below about one second, switching to streaming is usually cheaper.

Q: Do I still need a cache if I'm streaming?

Yes. The local rule set the stream maintains is the cache; the stream keeps it fresh and a last-known-good fallback keeps evaluation working through a control-plane outage.

This guide is part of the Backend Evaluation & Server-Side SDKs series. Once you commit to local, in-process evaluation, the only open question is how the local rule set stays fresh: a server-side SDK either polls the control plane on an interval or holds a streaming connection that pushes changes as they happen. That single choice sets your propagation latency, your connection budget, and how fast a kill switch actually reaches every node.

Problem Framing: When the Transport Choice Matters

Transport is invisible until it isn’t. Polling on a 60-second interval means a flag flip — including an emergency rollback — can take up to a full minute to reach a given replica, and a fleet of 500 pods polling independently produces a steady drumbeat of requests against the control plane. Streaming collapses propagation to sub-second but holds a long-lived connection per process and demands disciplined reconnection logic.

This guide covers the decision and the wiring for both transports under OpenFeature. It does not cover the cache topology behind the SDK (see distributed caching for flag evaluations) or rule compilation (see optimizing rule engine performance).

Polling applies a change only at the next interval tick; streaming pushes it the moment the control plane records it.

Prerequisites

OpenFeature server SDK ≥ 1.x with a provider that supports both transports (flagd does)
Network egress from every service replica to the control plane endpoint
A health/readiness probe wired to the provider’s connection state
Flag metadata fields owner and expiry populated per your flag taxonomy
An agreed propagation-latency budget (e.g. “rollback visible within 2s”)

Core Concept & Architecture

Both transports converge on the same local state — a compiled rule set the rule engine reads in-process. They differ only in how that state is refreshed. The decision matrix:

Dimension	Polling	Streaming (SSE)
Propagation latency	Up to one interval	Sub-second
Control-plane load	Requests × replicas ÷ interval	One open connection per replica
Resilience to blips	Trivially stateless	Needs reconnect + resync
Firewall friendliness	High (plain HTTP)	Lower (long-lived connection)
Best for	Large fleets, relaxed SLAs	Kill switches, fast canaries

A robust setup is rarely pure: stream for low-latency propagation, and keep a slow background poll as a safety net that heals missed events during a reconnect gap.

Step-by-Step Implementation

Step 1 — Configure streaming as the primary transport

Point the provider at the control plane and select the streaming resolver so changes arrive over a persistent connection.

# provider-sync.yaml — flagd sync configuration
sync:
  selector: "core"
  provider: streaming          # primary: push-based updates
  uri: "grpc://flagd.internal:8013"
  poll_interval_ms: 30000       # fallback poll heals missed events
cache:
  local_ttl: 0                  # 0 = trust the stream; no independent expiry
  fallback: last_known_good

Pitfall: setting a nonzero local_ttl alongside streaming creates two sources of truth — the stream says “current” while the TTL silently expires entries. Let the stream own freshness and use the poll only as a backstop.

Step 2 — Wire reconnection with backoff and resync

A streaming connection will drop. On reconnect, do a full resync rather than assuming you only missed the latest event.

import { OpenFeature } from '@openfeature/server-sdk';

provider.on('reconnecting', () => metrics.increment('flag.stream.reconnect'));
provider.on('ready', async () => {
  await provider.resyncFlags();           // pull full state, not just a delta
  metrics.gauge('flag.stream.connected', 1);
});
provider.on('error', () => metrics.gauge('flag.stream.connected', 0));

Pitfall: applying only the delta after a gap leaves a node permanently stale for any flag changed during the disconnect. Always resync the full set. See exponential backoff for SDK reconnection for the backoff curve.

Step 3 — Fall back to polling where streaming is impractical

Behind strict proxies or in serverless runtimes that recycle connections, a short poll is more reliable than a stream that constantly re-establishes.

# serverless-sync.yaml — polling-only profile
sync:
  provider: polling
  uri: "https://flagd.internal/flags"
  poll_interval_ms: 5000        # tighten interval to shrink the staleness window
cache:
  local_ttl: 5s
  fallback: last_known_good

Pitfall: every replica polling on the same fixed interval synchronizes into a thundering herd. Add jitter (±20%) to the interval so requests spread across the window.

Verification & Testing

Prove propagation latency rather than assuming it. Flip a canary flag and measure the time until every replica reports the new variant.

# Flip the flag, then poll each replica's evaluation endpoint until consistent
flagctl set web.dashboard.new-nav --variant on
for host in $(cat replicas.txt); do
  curl -s "$host/debug/flags/web.dashboard.new-nav" | jq -r '.variant'
done | sort -u   # expect a single line "on" within your latency budget

For streaming, assert reconnect behavior by killing the connection (block the port for 5s) and confirming a full resync fires on recovery.

Troubleshooting & FAQ

Why is a flag change not reaching some replicas?

On streaming, those replicas almost certainly lost the connection and resynced from a stale snapshot, or never reconnected — check your flag.stream.connected gauge. On polling, the change simply hasn’t reached the next tick yet, or the replica’s poll is failing silently; verify the poll response status.

How small can I make the polling interval?

Small enough to meet your propagation budget, large enough that replicas ÷ interval stays within the control plane’s request budget. Below ~1s, the request volume usually justifies switching to streaming instead.

Do I still need a cache if I’m streaming?

Yes — the local rule set the stream maintains is the cache. The stream keeps it fresh; the last_known_good fallback keeps evaluation working through a control-plane outage. Caching strategy is covered in distributed caching for flag evaluations.

Performance & Scale Considerations

Budget propagation as part of your rollout SLA, not as an afterthought. For a fleet of N replicas, polling costs roughly N ÷ interval requests per second against the control plane; streaming costs N persistent connections plus a burst of resyncs whenever the control plane redeploys. Above a few hundred replicas, prefer streaming with a generous fallback poll and stagger replica restarts so resyncs don’t arrive as one spike. Keep the evaluation path itself local so transport hiccups never add latency to a request — that separation is the whole point of backend evaluation.