duckdb/duckdb

Architecture

DuckDB is an in-process analytical database. A query goes through a six-stage pipeline (parse → bind → plan → optimize → physical-plan → execute), then runs as a graph of vectorized pipelines on a fixed-size thread pool. This page maps that pipeline to the directories under src/.

Query lifecycle

graph TD
    SQL[SQL string] -->|Parser::ParseQuery| AST[SQLStatement / ParsedExpression / TableRef]
    AST -->|Binder::Bind| Bound[BoundStatement: bound Expression + LogicalOperator]
    Bound -->|Planner| Logical[Logical plan]
    Logical -->|Optimizer::Optimize| Optimized[Optimized logical plan]
    Optimized -->|PhysicalPlanGenerator| Physical[PhysicalOperator tree]
    Physical -->|Executor::Initialize| Pipelines[Pipeline DAG]
    Pipelines -->|TaskScheduler| Results[DataChunk results]

Entry points to read in order:

Stage	File	What it does
1. Parse	`src/parser/parser.cpp` (`Parser::ParseQuery`)	Convert SQL → `SQLStatement` AST using a PEG grammar.
2. Bind	`src/planner/binder.cpp`, `src/planner/expression_binder.cpp`	Resolve names against the catalog, infer types, produce bound expressions.
3. Plan	`src/planner/planner.cpp` (`Planner::CreatePlan`)	Build a `LogicalOperator` tree from bound statements.
4. Optimize	`src/optimizer/optimizer.cpp` (`Optimizer::Optimize`)	Apply rewrite rules, predicate pushdown, join ordering, statistics propagation.
5. Physical	`src/execution/physical_plan_generator.cpp`	Lower logical operators to `PhysicalOperator`.
6. Execute	`src/parallel/executor.cpp`, `pipeline_executor.cpp`, `task_scheduler.cpp`	Build pipelines, schedule tasks, return results.

Where things live

src/
├── parser/        SQL → AST (PEG grammar in parser/peg/)
├── planner/       AST → bound logical plan
├── optimizer/     Logical-plan rewrites and join ordering
├── execution/     Logical → physical, vectorized operators
├── parallel/      Pipelines, events, task scheduler, executor
├── function/      Scalar/aggregate/table/window/pragma functions
├── catalog/       Schemas, tables, functions, dependencies
├── transaction/   MVCC transaction manager, undo buffers
├── storage/       Block manager, buffer manager, WAL, checkpointing
├── main/          DatabaseInstance, ClientContext, connections, C API
├── common/        Vector, DataChunk, Value, types, file system, allocators
├── logging/       Structured logging primitives
└── include/duckdb/  All public headers (mirrors the source tree)

Extensions live alongside the engine in extension/ and link in either statically (in-tree) or dynamically. See extensions.

Vectorized execution

DuckDB processes data in vectors — typed columnar buffers with a fixed maximum size (STANDARD_VECTOR_SIZE, currently 2048). A DataChunk is a row of vectors, one per output column. Operators consume one chunk and emit one chunk at a time. The engine never allocates a row at a time.

Key types:

Type	File	Role
`Vector`	`src/include/duckdb/common/types/vector.hpp`	A columnar buffer plus type, validity mask, and optional dictionary/sequence/constant compression.
`DataChunk`	`src/include/duckdb/common/types/data_chunk.hpp`	A row of `Vector` objects with a shared cardinality.
`Value`	`src/include/duckdb/common/types/value.hpp`	A heap-allocated single value, used at the SQL-frontend boundary.
`LogicalType`	`src/include/duckdb/common/types.hpp`	Type metadata (id, width, child types).
`ExpressionExecutor`	`src/execution/expression_executor.cpp`	Evaluates a tree of bound `Expression` against a `DataChunk`.

Push-based pipelines

After physical-plan generation, the executor splits the operator tree into pipelines. A pipeline starts at a source operator (e.g., a table scan), pushes vectors through a chain of OperatorResult-returning operators (filters, projections), and ends at a sink operator (e.g., hash join build, hash aggregate). Sinks are blocking — downstream pipelines start only when their dependency sinks finish.

graph LR
    Src[Source: Scan] -->|GetData chunk| Op1[Filter]
    Op1 -->|Execute chunk| Op2[Projection]
    Op2 -->|Sink chunk| HJ[HashJoin Build]
    HJ -.->|finish| Probe[HashJoin Probe pipeline]

Pipeline orchestration lives in src/parallel/:

meta_pipeline.cpp — groups pipelines that share an executor/sink.
pipeline.cpp — single linear pipeline.
pipeline_executor.cpp — drives a single pipeline to completion in a worker thread.
task_scheduler.cpp — fixed-size worker pool, pulls tasks from queues.
executor.cpp — top-level Executor that owns the pipeline DAG and result collection.

See systems/parallel.

Storage layout

A DuckDB database is a single file. The file is divided into fixed-size blocks (default 256 KB) managed by src/storage/single_file_block_manager.cpp. The BufferManager (src/storage/standard_buffer_manager.cpp) keeps blocks in a memory budget and spills to a temporary file when over budget. Tables are stored as a sequence of row groups (src/storage/table/), each containing column chunks compressed with one of the codecs in src/storage/compression/.

Durability is provided by the WAL (src/storage/write_ahead_log.cpp, wal_replay.cpp) and periodic checkpoints (src/storage/checkpoint_manager.cpp). Multi-version concurrency control lives in src/transaction/duck_transaction_manager.cpp with undo buffers in src/transaction/undo_buffer.cpp.

See systems/storage and systems/transaction.

Embedding model

DuckDB is in-process. There is no server process. Clients call into a DatabaseInstance (src/main/database.cpp), open a Connection (src/main/connection.cpp), and run queries through a ClientContext (src/main/client_context.cpp). The C API in src/main/capi/ exposes this to non-C++ callers; language bindings (Python, R, Node, Java, Wasm) live in separate repositories that link against this engine.

See systems/main for the embedding surface.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.