duckdb/duckdb
Storage
Active contributors: Mytherin, Tishj, Mark
Purpose
src/storage/ owns the on-disk database file, the buffer manager that brings blocks into memory, table data structures (row groups, segments), compression codecs, the write-ahead log, and the checkpoint protocol. A DuckDB database is one file; this directory is what makes that work.
Directory layout
src/storage/
├── storage_manager.cpp Per-database init, attach/detach, version checks
├── single_file_block_manager.cpp Default block manager: maps logical blocks to file offsets
├── block.cpp Block primitive
├── block_allocator.cpp Allocates fresh blocks
├── buffer_manager.cpp Buffer manager interface
├── standard_buffer_manager.cpp Default LRU+pinning buffer manager
├── partial_block_manager.cpp Sub-block packing for small tables
├── arena_allocator.cpp Arena allocator for pipelined writes
├── checkpoint_manager.cpp Checkpoint orchestration
├── write_ahead_log.cpp WAL writer
├── wal_replay.cpp WAL replay on startup
├── temporary_file_manager.cpp Disk spill manager
├── temporary_memory_manager.cpp In-memory budget for spillable structures
├── data_table.cpp Per-table append/scan/update/delete
├── local_storage.cpp Per-transaction uncommitted changes
├── data_pointer.cpp Persisted block-pointer metadata
├── optimistic_data_writer.cpp Eager writes for large appends
├── storage_index.cpp Index entry persistence
├── table_index_list.cpp Per-table list of indexes
├── magic_bytes.cpp File-header magic-byte detection
├── storage_lock.cpp Reader/writer locks for storage state
├── storage_info.cpp Storage version constants
├── version_map.json Version compatibility map
├── open_file_storage_extension.cpp Hook for storage extensions
├── index.cpp Index registration
├── buffer/ Allocator and pinning helpers
├── checkpoint/ Checkpoint state machines
├── compression/ Per-codec compress/decompress
├── external_file_cache/ Cache for remote/external files
├── metadata/ Catalog metadata blocks
├── serialization/ Generated serialize/deserialize for plans + storage
├── statistics/ Per-segment / per-row-group statistics
└── table/ Row group, column data, segment treesKey abstractions
| Type | File | Role |
|---|---|---|
StorageManager |
src/storage/storage_manager.cpp |
Per-database top-level: opens the file, runs WAL replay, reads schemas, decides if a checkpoint is needed. |
BlockManager |
src/include/duckdb/storage/block_manager.hpp |
Abstract: maps logical block IDs to bytes. Default implementation SingleFileBlockManager keeps everything in one file. |
BufferManager |
src/include/duckdb/storage/buffer_manager.hpp |
Manages a memory budget for blocks. StandardBufferManager is the production implementation. |
BlockHandle / BufferHandle |
src/storage/buffer/ |
RAII handles for pinning a block in memory; release happens when the handle is destroyed. |
DataTable |
src/storage/data_table.cpp |
Per-table API used by INSERT/UPDATE/DELETE/SCAN. Holds row groups, indexes, statistics. |
RowGroup |
src/storage/table/row_group.cpp |
A horizontal slice of a table (default 122,880 rows). |
ColumnData |
src/storage/table/column_data.cpp |
Per-column storage inside a row group, made of ColumnSegments. |
WriteAheadLog |
src/storage/write_ahead_log.cpp |
Streaming append-only log written by every committing transaction. |
CheckpointManager |
src/storage/checkpoint_manager.cpp |
Periodically flushes dirty data into the main file and truncates the WAL. |
LocalStorage |
src/storage/local_storage.cpp |
Per-transaction view of uncommitted appends/updates/deletes. |
TemporaryFileManager |
src/storage/temporary_file_manager.cpp |
Spills blocks to disk when the buffer manager is over budget. |
How it works
graph TD
SQL[INSERT/UPDATE/DELETE] -->|via PhysicalOperator| LS[LocalStorage]
LS -->|on commit| DT[DataTable]
DT -->|append| RG[RowGroup -> ColumnData -> ColumnSegment]
RG -->|allocate blocks| BM[BlockManager]
BM -->|read/write| BUF[BufferManager]
BUF -->|pin/unpin| FILE[Single database file]
DT -.->|log entry| WAL[WriteAheadLog]
WAL -->|periodic| CK[CheckpointManager]
CK -->|flush dirty + truncate WAL| FILESingle-file storage
A DuckDB database file is divided into fixed-size blocks. The default block size is 256 KB; it is fixed at database creation time. The first few blocks contain metadata (the database header, schema metadata, free-block lists). Tables and indexes live in the remaining blocks.
SingleFileBlockManager (single_file_block_manager.cpp) tracks:
- The current header.
- A free list of unused blocks.
- A used list of allocated blocks per object.
Two header copies are written alternately so that a crash at any point leaves at least one valid header.
Row groups and segments
Each table is a sequence of RowGroups. A row group:
- Has a fixed maximum row count (122,880).
- Stores per-column statistics (min/max/distinct/null counts).
- Contains one
ColumnDataper column, made of one or moreColumnSegments.
Segments are compressed using one of the codecs in src/storage/compression/ — uncompressed, bitpacking, dictionary, chimp, patas, alp, fsst, rle. Compression is chosen per-segment via compression_config.cpp based on a quick analysis pass.
Write-ahead log
Every committed write produces WAL records (write_ahead_log.cpp). On startup, wal_replay.cpp reads the WAL, replays the records into the in-memory state, and triggers a checkpoint if needed. The WAL is a separate file with a .wal suffix next to the database file.
Checkpointing
CheckpointManager (checkpoint_manager.cpp) flushes all dirty in-memory data into the main file and truncates the WAL. It is triggered on database close, on user request (PRAGMA force_checkpoint), or automatically when the WAL grows past a threshold.
Buffer management
StandardBufferManager keeps a budget (default 80% of available memory). When a block is pinned, it is read from disk if not already in memory; if the budget is exceeded, victim blocks are spilled to a temporary file via TemporaryFileManager.
temporary_memory_manager.cpp handles in-memory budgeting for spillable structures (sort buffers, hash tables) so they cooperate with the buffer manager rather than fight it.
External file cache
src/storage/external_file_cache/ caches data read from remote sources (S3, HTTP, local file system) into a configurable disk-backed cache. This is the integration point for httpfs-style extensions.
Integration points
- Tables/scans in execution call into
DataTablefor reads (DataTable::Scan) and writes (DataTable::Append,DataTable::Update,DataTable::Delete). - Transactions (transaction) coordinate reads of versioned data through
DuckTransactionand produce undo records that mirror what is written here. - Catalog (catalog) persists
CatalogEntrys through this layer in metadata blocks (storage/metadata/). - Compression configuration flows from
src/function/compression_config.cppandsrc/storage/compression/.
Entry points for modification
- Adding a compression codec: implement
CompressionFunctioninsrc/storage/compression/<codec>/, register incompression_config.cpp, add tests intest/sql/compression/. - Storage format changes: bump the storage version (
storage_info.cpp,version_map.json), add backward-compatibility tests intest/bwc/, runscripts/test_storage_compatibility.py. - Adjusting buffer memory:
PRAGMA memory_limit = '4GB'andPRAGMA temp_directory = '/tmp/...'. Implementation:standard_buffer_manager.cpp,temporary_file_manager.cpp. - Implementing an alternative block manager (e.g., for an embedded environment): subclass
BlockManager. The interface lives insrc/include/duckdb/storage/block_manager.hppand is intentionally narrow. - Adding row-group-level pruning hints:
RowGroupPrunerin optimizer consumes statistics produced here.
Key source files
| File | Purpose |
|---|---|
src/storage/storage_manager.cpp |
Database open/close lifecycle. |
src/storage/single_file_block_manager.cpp |
Single-file block layout. |
src/storage/standard_buffer_manager.cpp |
Block cache + spill coordination. |
src/storage/data_table.cpp |
Per-table read/write API. |
src/storage/local_storage.cpp |
Per-transaction local state. |
src/storage/checkpoint_manager.cpp |
Checkpoint orchestration. |
src/storage/write_ahead_log.cpp |
WAL writer. |
src/storage/wal_replay.cpp |
WAL replay. |
src/storage/table/row_group.cpp |
Row group structure. |
src/storage/compression/*/*.cpp |
Per-codec implementations. |
Continue to transaction for MVCC and durability semantics, or catalog for how schemas/tables are stored.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.