Open-Source Wikis

/

Consul

/

Features

/

Snapshots and disaster recovery

hashicorp/consul

Snapshots and disaster recovery

Consul's authoritative state lives in the Raft FSM in memory plus the Raft log on disk. Operators can capture a point-in-time snapshot of the FSM and restore from it later — to a fresh cluster, to a different DC, or as part of routine backups.

Pieces

Piece Where
FSM Snapshot / Restore agent/consul/fsm/snapshot.go, agent/consul/fsm/snapshot_ce.go
Snapshot RPC agent/consul/snapshot_endpoint.go
Snapshot HTTP agent/snapshot_endpoint.go
Archive format snapshot/archive.go, snapshot/snapshot.go
CLI command/snapshot/save, restore, inspect, decode
Public Go API api/snapshot.go

Archive format

A Consul snapshot is a tar archive containing:

Member Purpose
meta.json Version, ID, snapshot index, term, checksums
state.bin The serialized FSM state (msgpack-encoded record stream)
SHA256SUMS Manifest of expected hashes; verified on restore

snapshot/archive.go writes and reads the tar. snapshot/snapshot.go glues it to the Raft snapshot interface.

How it's produced

sequenceDiagram
    participant Op as Operator
    participant CLI as consul snapshot save
    participant API as api.Client
    participant Server as snapshot_endpoint.go
    participant Raft as Raft+FSM
    participant Disk as state.bin

    Op->>CLI: consul snapshot save backup.snap
    CLI->>API: GET /v1/snapshot
    API->>Server: HTTP request
    Server->>Raft: raft.Snapshot()
    Raft->>FSM: Persist(io.Writer)
    FSM->>Disk: stream every state-store record
    Raft-->>Server: stream + meta
    Server-->>API: archive (tar)
    API-->>CLI: archive bytes
    CLI-->>Op: backup.snap

The FSM iterates every MemDB table (catalog, KV, ACLs, sessions, intentions, config entries, peerings, ...) and writes records with a type tag and msgpack body. Code: agent/consul/fsm/snapshot_ce.go.

How it's restored

sequenceDiagram
    participant Op as Operator
    participant CLI as consul snapshot restore
    participant Server as snapshot_endpoint.go
    participant Raft as Raft
    participant FSM as fsm
    participant State as new state.Store

    Op->>CLI: consul snapshot restore backup.snap
    CLI->>Server: PUT /v1/snapshot
    Server->>Server: validate archive (sha256)
    Server->>Raft: raft.Restore(reader)
    Raft->>FSM: Restore(io.ReadCloser)
    FSM->>State: NewStateStore + bulk insert via state.Restore
    State-->>FSM: ok
    FSM-->>Raft: ok
    Server-->>CLI: 200 OK

Restore uses state.Restore (agent/consul/state/state_store.go) which builds a fresh MemDB inside a write transaction and replaces the live store atomically when done.

Restoring on a leader replaces the cluster's state. On a follower it hard-resets to the snapshot's FSM and the Raft log will catch up subsequent entries.

Inspecting a snapshot

consul snapshot inspect reports metadata and per-type record counts without touching the cluster. consul snapshot decode produces a JSON dump of the entire snapshot for offline analysis. Implementations live in command/snapshot/inspect/ and command/snapshot/decode/. The decoder mirrors the FSM dispatch by type.

Operator workflows

# Daily backup
consul snapshot save daily.snap

# Inspect
consul snapshot inspect daily.snap

# Restore (caution: replaces cluster state)
consul snapshot restore daily.snap

# Programmatic
api.Snapshot().Save(context.Background(), &api.QueryOptions{})

Snapshots are typically taken hourly and stored off-cluster. The Consul Enterprise auto-snapshot daemon runs them on a schedule with retention policy; CE operators do this with cron.

Integration with Raft

The same machinery powers Raft's internal snapshots: when the Raft log grows past RaftSnapshotInterval, Raft asks the FSM to snapshot itself, then truncates the log. The on-disk Raft snapshot store is raft-boltdb or raft-wal. consul snapshot save/restore simply intercepts that flow at user request.

Disaster recovery scenarios

Scenario Procedure
Lost quorum, but >1 server alive consul operator raft remove-peer for the lost peers
Total loss, restoring from backup Bootstrap a single-server cluster with -bootstrap=true, then consul snapshot restore. Add additional servers afterwards.
Migrating between regions Snapshot + restore into a new cluster; reissue ACL tokens; reset Connect CA if changing trust domain
Cross-version restore Snapshots are forward/backward compatible across recent versions; check the release notes for any explicit bumps

Entry points for modification

  • Add a new state-store type that should appear in snapshots: register it in agent/consul/fsm/snapshot_ce.go (both write and read paths) and in agent/consul/state/state_store.go::Restore.
  • Change the archive format: bump the schema version in snapshot/archive.go. Be careful — older Consuls won't read newer snapshots.
  • Optimize restore: the bulk-load path lives in agent/consul/state/state_store.go::Restore and uses MemDB's bulk insert.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Snapshots and disaster recovery – Consul wiki | Factory