cloudflare/pingora

Health checks and discovery

The pingora-load-balancing crate offers backend health checking and a small service-discovery interface. They're both plug-ins on a LoadBalancer<S>.

The user guide entry is docs/user_guide/failover.md.

How they fit together

graph TD
    discover[Discovery::discover<br/>periodic or static]
    update[LoadBalancer::update]
    backends[Backends — full set]
    hc[HealthCheck task<br/>per backend]
    healthy[Healthy subset]
    pick[LoadBalancer::select]
    proxy[Proxy::upstream_peer]

    discover --> update
    update --> backends
    backends --> hc
    hc --> healthy
    healthy --> pick
    pick --> proxy

Both health-check and discovery loops run inside a BackgroundService (pingora-load-balancing/src/background.rs). The user wraps a LoadBalancer with background_service("name", lb) and adds the resulting service to the Server.

Health checks

Two implementations live in pingora-load-balancing/src/health_check.rs:

Type	Behavior
`TcpHealthCheck`	Open a TCP socket. If TLS settings are configured, also complete the TLS handshake.
`HttpHealthCheck`	Issue an HTTP request and assert on response status code(s).

Configuration:

let mut upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"])?;
let hc = TcpHealthCheck::new();
upstreams.set_health_check(hc);
upstreams.health_check_frequency = Some(Duration::from_secs(1));

Backends start "presumed healthy." Each tick, the health-check task probes every backend and updates the healthy bitmap. Selection (LoadBalancer::select) walks the backend ring/list and skips the unhealthy ones.

Two thresholds are tunable on the HealthCheck impl:

consecutive_success — how many successful probes flip an unhealthy backend back to healthy
consecutive_failure — how many failed probes flip a healthy backend to unhealthy

The Mar 2026 commit b370102 added discovery_duration and build_duration to the LoadBalancer::update path so operators can observe how slow the update step is. These show up via the standard prometheus integration.

Discovery

Discovery is a trait with one method:

#[async_trait]
pub trait ServiceDiscovery {
    async fn discover(&self) -> Result<(BTreeSet<Backend>, HashMap<u64, bool>)>;
}

The default is Static, which returns the same set every time. Custom implementations might:

Resolve DNS SRV records
Query a Kubernetes API for endpoints
Read a config file that's edited at runtime
Subscribe to a service-mesh control plane

The discovery loop polls the trait at a configurable interval (update_frequency on the load balancer). Each discover() call produces a fresh BTreeSet<Backend>. The load balancer diffs old vs new and updates the ring or list accordingly.

Failover semantics

The proxy retries on connect failure if the error's retry flag is true. The pattern in upstream_peer:

Get a backend from LoadBalancer::select.
Build a Peer, return it.
If fail_to_connect is called, the user can either:
- Mark the error retryable and let the proxy call upstream_peer again (which might pick a different backend).
- Mark it not-retryable and let the request fail.

docs/user_guide/failover.md walks through this end-to-end.

Source map

File	What
`pingora-load-balancing/src/lib.rs`	`LoadBalancer<S>`, `Backend`, `Backends`, top-level glue
`pingora-load-balancing/src/health_check.rs`	`TcpHealthCheck`, `HttpHealthCheck`, `HealthCheck` trait
`pingora-load-balancing/src/discovery.rs`	`ServiceDiscovery` trait, static impl
`pingora-load-balancing/src/background.rs`	`BackgroundService` wrapper
`pingora-load-balancing/src/selection/`	Selection strategies

How they fit together

Health checks

Discovery

Failover semantics

Source map

See also