duckdb/duckdb
Architecture
DuckDB is an in-process analytical database. A query goes through a six-stage pipeline (parse → bind → plan → optimize → physical-plan → execute), then runs as a graph of vectorized pipelines on a fixed-size thread pool. This page maps that pipeline to the directories under src/.
Query lifecycle
graph TD
SQL[SQL string] -->|Parser::ParseQuery| AST[SQLStatement / ParsedExpression / TableRef]
AST -->|Binder::Bind| Bound[BoundStatement: bound Expression + LogicalOperator]
Bound -->|Planner| Logical[Logical plan]
Logical -->|Optimizer::Optimize| Optimized[Optimized logical plan]
Optimized -->|PhysicalPlanGenerator| Physical[PhysicalOperator tree]
Physical -->|Executor::Initialize| Pipelines[Pipeline DAG]
Pipelines -->|TaskScheduler| Results[DataChunk results]Entry points to read in order:
| Stage | File | What it does |
|---|---|---|
| 1. Parse | src/parser/parser.cpp (Parser::ParseQuery) |
Convert SQL → SQLStatement AST using a PEG grammar. |
| 2. Bind | src/planner/binder.cpp, src/planner/expression_binder.cpp |
Resolve names against the catalog, infer types, produce bound expressions. |
| 3. Plan | src/planner/planner.cpp (Planner::CreatePlan) |
Build a LogicalOperator tree from bound statements. |
| 4. Optimize | src/optimizer/optimizer.cpp (Optimizer::Optimize) |
Apply rewrite rules, predicate pushdown, join ordering, statistics propagation. |
| 5. Physical | src/execution/physical_plan_generator.cpp |
Lower logical operators to PhysicalOperator. |
| 6. Execute | src/parallel/executor.cpp, pipeline_executor.cpp, task_scheduler.cpp |
Build pipelines, schedule tasks, return results. |
Where things live
src/
├── parser/ SQL → AST (PEG grammar in parser/peg/)
├── planner/ AST → bound logical plan
├── optimizer/ Logical-plan rewrites and join ordering
├── execution/ Logical → physical, vectorized operators
├── parallel/ Pipelines, events, task scheduler, executor
├── function/ Scalar/aggregate/table/window/pragma functions
├── catalog/ Schemas, tables, functions, dependencies
├── transaction/ MVCC transaction manager, undo buffers
├── storage/ Block manager, buffer manager, WAL, checkpointing
├── main/ DatabaseInstance, ClientContext, connections, C API
├── common/ Vector, DataChunk, Value, types, file system, allocators
├── logging/ Structured logging primitives
└── include/duckdb/ All public headers (mirrors the source tree)Extensions live alongside the engine in extension/ and link in either statically (in-tree) or dynamically. See extensions.
Vectorized execution
DuckDB processes data in vectors — typed columnar buffers with a fixed maximum size (STANDARD_VECTOR_SIZE, currently 2048). A DataChunk is a row of vectors, one per output column. Operators consume one chunk and emit one chunk at a time. The engine never allocates a row at a time.
Key types:
| Type | File | Role |
|---|---|---|
Vector |
src/include/duckdb/common/types/vector.hpp |
A columnar buffer plus type, validity mask, and optional dictionary/sequence/constant compression. |
DataChunk |
src/include/duckdb/common/types/data_chunk.hpp |
A row of Vector objects with a shared cardinality. |
Value |
src/include/duckdb/common/types/value.hpp |
A heap-allocated single value, used at the SQL-frontend boundary. |
LogicalType |
src/include/duckdb/common/types.hpp |
Type metadata (id, width, child types). |
ExpressionExecutor |
src/execution/expression_executor.cpp |
Evaluates a tree of bound Expression against a DataChunk. |
Push-based pipelines
After physical-plan generation, the executor splits the operator tree into pipelines. A pipeline starts at a source operator (e.g., a table scan), pushes vectors through a chain of OperatorResult-returning operators (filters, projections), and ends at a sink operator (e.g., hash join build, hash aggregate). Sinks are blocking — downstream pipelines start only when their dependency sinks finish.
graph LR
Src[Source: Scan] -->|GetData chunk| Op1[Filter]
Op1 -->|Execute chunk| Op2[Projection]
Op2 -->|Sink chunk| HJ[HashJoin Build]
HJ -.->|finish| Probe[HashJoin Probe pipeline]Pipeline orchestration lives in src/parallel/:
meta_pipeline.cpp— groups pipelines that share an executor/sink.pipeline.cpp— single linear pipeline.pipeline_executor.cpp— drives a single pipeline to completion in a worker thread.task_scheduler.cpp— fixed-size worker pool, pulls tasks from queues.executor.cpp— top-levelExecutorthat owns the pipeline DAG and result collection.
See systems/parallel.
Storage layout
A DuckDB database is a single file. The file is divided into fixed-size blocks (default 256 KB) managed by src/storage/single_file_block_manager.cpp. The BufferManager (src/storage/standard_buffer_manager.cpp) keeps blocks in a memory budget and spills to a temporary file when over budget. Tables are stored as a sequence of row groups (src/storage/table/), each containing column chunks compressed with one of the codecs in src/storage/compression/.
Durability is provided by the WAL (src/storage/write_ahead_log.cpp, wal_replay.cpp) and periodic checkpoints (src/storage/checkpoint_manager.cpp). Multi-version concurrency control lives in src/transaction/duck_transaction_manager.cpp with undo buffers in src/transaction/undo_buffer.cpp.
See systems/storage and systems/transaction.
Embedding model
DuckDB is in-process. There is no server process. Clients call into a DatabaseInstance (src/main/database.cpp), open a Connection (src/main/connection.cpp), and run queries through a ClientContext (src/main/client_context.cpp). The C API in src/main/capi/ exposes this to non-C++ callers; language bindings (Python, R, Node, Java, Wasm) live in separate repositories that link against this engine.
See systems/main for the embedding surface.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.