hashicorp/consul
Snapshots and disaster recovery
Consul's authoritative state lives in the Raft FSM in memory plus the Raft log on disk. Operators can capture a point-in-time snapshot of the FSM and restore from it later — to a fresh cluster, to a different DC, or as part of routine backups.
Pieces
| Piece | Where |
|---|---|
FSM Snapshot / Restore |
agent/consul/fsm/snapshot.go, agent/consul/fsm/snapshot_ce.go |
| Snapshot RPC | agent/consul/snapshot_endpoint.go |
| Snapshot HTTP | agent/snapshot_endpoint.go |
| Archive format | snapshot/archive.go, snapshot/snapshot.go |
| CLI | command/snapshot/save, restore, inspect, decode |
| Public Go API | api/snapshot.go |
Archive format
A Consul snapshot is a tar archive containing:
| Member | Purpose |
|---|---|
meta.json |
Version, ID, snapshot index, term, checksums |
state.bin |
The serialized FSM state (msgpack-encoded record stream) |
SHA256SUMS |
Manifest of expected hashes; verified on restore |
snapshot/archive.go writes and reads the tar. snapshot/snapshot.go glues it to the Raft snapshot interface.
How it's produced
sequenceDiagram
participant Op as Operator
participant CLI as consul snapshot save
participant API as api.Client
participant Server as snapshot_endpoint.go
participant Raft as Raft+FSM
participant Disk as state.bin
Op->>CLI: consul snapshot save backup.snap
CLI->>API: GET /v1/snapshot
API->>Server: HTTP request
Server->>Raft: raft.Snapshot()
Raft->>FSM: Persist(io.Writer)
FSM->>Disk: stream every state-store record
Raft-->>Server: stream + meta
Server-->>API: archive (tar)
API-->>CLI: archive bytes
CLI-->>Op: backup.snapThe FSM iterates every MemDB table (catalog, KV, ACLs, sessions, intentions, config entries, peerings, ...) and writes records with a type tag and msgpack body. Code: agent/consul/fsm/snapshot_ce.go.
How it's restored
sequenceDiagram
participant Op as Operator
participant CLI as consul snapshot restore
participant Server as snapshot_endpoint.go
participant Raft as Raft
participant FSM as fsm
participant State as new state.Store
Op->>CLI: consul snapshot restore backup.snap
CLI->>Server: PUT /v1/snapshot
Server->>Server: validate archive (sha256)
Server->>Raft: raft.Restore(reader)
Raft->>FSM: Restore(io.ReadCloser)
FSM->>State: NewStateStore + bulk insert via state.Restore
State-->>FSM: ok
FSM-->>Raft: ok
Server-->>CLI: 200 OKRestore uses state.Restore (agent/consul/state/state_store.go) which builds a fresh MemDB inside a write transaction and replaces the live store atomically when done.
Restoring on a leader replaces the cluster's state. On a follower it hard-resets to the snapshot's FSM and the Raft log will catch up subsequent entries.
Inspecting a snapshot
consul snapshot inspect reports metadata and per-type record counts without touching the cluster. consul snapshot decode produces a JSON dump of the entire snapshot for offline analysis. Implementations live in command/snapshot/inspect/ and command/snapshot/decode/. The decoder mirrors the FSM dispatch by type.
Operator workflows
# Daily backup
consul snapshot save daily.snap
# Inspect
consul snapshot inspect daily.snap
# Restore (caution: replaces cluster state)
consul snapshot restore daily.snap
# Programmatic
api.Snapshot().Save(context.Background(), &api.QueryOptions{})Snapshots are typically taken hourly and stored off-cluster. The Consul Enterprise auto-snapshot daemon runs them on a schedule with retention policy; CE operators do this with cron.
Integration with Raft
The same machinery powers Raft's internal snapshots: when the Raft log grows past RaftSnapshotInterval, Raft asks the FSM to snapshot itself, then truncates the log. The on-disk Raft snapshot store is raft-boltdb or raft-wal. consul snapshot save/restore simply intercepts that flow at user request.
Disaster recovery scenarios
| Scenario | Procedure |
|---|---|
| Lost quorum, but >1 server alive | consul operator raft remove-peer for the lost peers |
| Total loss, restoring from backup | Bootstrap a single-server cluster with -bootstrap=true, then consul snapshot restore. Add additional servers afterwards. |
| Migrating between regions | Snapshot + restore into a new cluster; reissue ACL tokens; reset Connect CA if changing trust domain |
| Cross-version restore | Snapshots are forward/backward compatible across recent versions; check the release notes for any explicit bumps |
Entry points for modification
- Add a new state-store type that should appear in snapshots: register it in
agent/consul/fsm/snapshot_ce.go(both write and read paths) and inagent/consul/state/state_store.go::Restore. - Change the archive format: bump the schema version in
snapshot/archive.go. Be careful — older Consuls won't read newer snapshots. - Optimize restore: the bulk-load path lives in
agent/consul/state/state_store.go::Restoreand uses MemDB's bulk insert.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.