cloudflare/pingora
Health checks and discovery
The pingora-load-balancing crate offers backend health checking and a small service-discovery interface. They're both plug-ins on a LoadBalancer<S>.
The user guide entry is docs/user_guide/failover.md.
How they fit together
graph TD
discover[Discovery::discover<br/>periodic or static]
update[LoadBalancer::update]
backends[Backends — full set]
hc[HealthCheck task<br/>per backend]
healthy[Healthy subset]
pick[LoadBalancer::select]
proxy[Proxy::upstream_peer]
discover --> update
update --> backends
backends --> hc
hc --> healthy
healthy --> pick
pick --> proxyBoth health-check and discovery loops run inside a BackgroundService (pingora-load-balancing/src/background.rs). The user wraps a LoadBalancer with background_service("name", lb) and adds the resulting service to the Server.
Health checks
Two implementations live in pingora-load-balancing/src/health_check.rs:
| Type | Behavior |
|---|---|
TcpHealthCheck |
Open a TCP socket. If TLS settings are configured, also complete the TLS handshake. |
HttpHealthCheck |
Issue an HTTP request and assert on response status code(s). |
Configuration:
let mut upstreams = LoadBalancer::try_from_iter(["1.1.1.1:443", "1.0.0.1:443"])?;
let hc = TcpHealthCheck::new();
upstreams.set_health_check(hc);
upstreams.health_check_frequency = Some(Duration::from_secs(1));Backends start "presumed healthy." Each tick, the health-check task probes every backend and updates the healthy bitmap. Selection (LoadBalancer::select) walks the backend ring/list and skips the unhealthy ones.
Two thresholds are tunable on the HealthCheck impl:
consecutive_success— how many successful probes flip an unhealthy backend back to healthyconsecutive_failure— how many failed probes flip a healthy backend to unhealthy
The Mar 2026 commit b370102 added discovery_duration and build_duration to the LoadBalancer::update path so operators can observe how slow the update step is. These show up via the standard prometheus integration.
Discovery
Discovery is a trait with one method:
#[async_trait]
pub trait ServiceDiscovery {
async fn discover(&self) -> Result<(BTreeSet<Backend>, HashMap<u64, bool>)>;
}The default is Static, which returns the same set every time. Custom implementations might:
- Resolve DNS SRV records
- Query a Kubernetes API for endpoints
- Read a config file that's edited at runtime
- Subscribe to a service-mesh control plane
The discovery loop polls the trait at a configurable interval (update_frequency on the load balancer). Each discover() call produces a fresh BTreeSet<Backend>. The load balancer diffs old vs new and updates the ring or list accordingly.
Failover semantics
The proxy retries on connect failure if the error's retry flag is true. The pattern in upstream_peer:
- Get a backend from
LoadBalancer::select. - Build a
Peer, return it. - If
fail_to_connectis called, the user can either:- Mark the error retryable and let the proxy call
upstream_peeragain (which might pick a different backend). - Mark it not-retryable and let the request fail.
- Mark the error retryable and let the proxy call
docs/user_guide/failover.md walks through this end-to-end.
Source map
| File | What |
|---|---|
pingora-load-balancing/src/lib.rs |
LoadBalancer<S>, Backend, Backends, top-level glue |
pingora-load-balancing/src/health_check.rs |
TcpHealthCheck, HttpHealthCheck, HealthCheck trait |
pingora-load-balancing/src/discovery.rs |
ServiceDiscovery trait, static impl |
pingora-load-balancing/src/background.rs |
BackgroundService wrapper |
pingora-load-balancing/src/selection/ |
Selection strategies |
See also
- pingora-load-balancing
- pingora-ketama (the backing implementation for
Consistentselection) docs/user_guide/failover.md
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.