Open-Source Wikis

/

DuckDB

/

Features

/

Vectorized execution

duckdb/duckdb

Vectorized execution

The single most important architectural decision in DuckDB is that everything is a chunk of vectors. This page traces what that means in practice, from the data structures up through the operator interface, the expression executor, and parallel pipelines.

The unit of work: DataChunk

A DataChunk (src/include/duckdb/common/types/data_chunk.hpp) is a row of Vectors sharing a single cardinality (number of rows). Operators pass chunks to each other; the engine never allocates rows individually in hot paths.

Default cardinality cap: STANDARD_VECTOR_SIZE, currently 2048. This is small enough to fit comfortably in L2 cache and large enough to amortize per-call dispatch cost.

graph LR
    Source[Source op] -->|DataChunk N rows| Filter[Filter]
    Filter -->|DataChunk M <= N rows| Project[Projection]
    Project -->|DataChunk M rows| Sink[Sink op]

The columnar buffer: Vector

A Vector (src/include/duckdb/common/types/vector.hpp) carries:

  • A LogicalType (id + width + child types).
  • A buffer of values of the appropriate physical width.
  • A ValidityMask of NULL bits.
  • An optional auxiliary buffer (for variable-length types like strings, lists, structs).
  • A VectorType indicating the encoding.

Encodings

Encoding When it is used
FLAT_VECTOR One value per slot — the unconditional default.
CONSTANT_VECTOR All N rows have the same value (e.g., a literal in a projection). One stored value, replicated logically.
DICTIONARY_VECTOR Index buffer + child vector with the unique values. Skips repeated work when many rows share values.
SEQUENCE_VECTOR Two scalars represent the whole vector as start + i * step. Used for things like range().
FSST_VECTOR Compressed strings sharing a symbol table.

Most executor code paths can short-circuit on CONSTANT_VECTOR and avoid touching N rows entirely. Vector::Flatten upgrades any encoding to FLAT_VECTOR when an operator cannot handle the original encoding.

UnifiedVectorFormat

When you must read across encodings without flattening, UnifiedVectorFormat (src/include/duckdb/common/types/vector.hpp) gives you:

  • data pointer
  • sel (SelectionVector for dictionary-encoded vectors)
  • validity mask

This is what BinaryExecutor and friends use under the hood.

The operator interface

Every PhysicalOperator (src/include/duckdb/execution/physical_operator.hpp) advertises one or more roles:

  • Source. Produces chunks via GetData. Has LocalSourceState and GlobalSourceState.
  • Operator. Transforms chunks via Execute.
  • Sink. Consumes chunks via Sink, has a Combine and Finalize step.

A pipeline is a chain that starts at a source, flows through zero or more intermediate operators, and ends at a sink.

SourceResultType GetData(ExecutionContext &context, DataChunk &chunk,
                         OperatorSourceInput &input) override;
OperatorResultType Execute(ExecutionContext &context, DataChunk &input,
                           DataChunk &chunk, GlobalOperatorState &gstate,
                           OperatorState &state) override;
SinkResultType Sink(ExecutionContext &context, DataChunk &chunk,
                    OperatorSinkInput &input) override;

Operators return OperatorResultType::HAVE_MORE_OUTPUT when they have more chunks ready (e.g., a filter that produced two output chunks from one input chunk).

The expression executor

ExpressionExecutor (src/execution/expression_executor.cpp) evaluates a vector of bound Expressions over a DataChunk and produces an output DataChunk. Each expression has per-thread scratch state; intermediate vectors are reused across chunks.

Per-class dispatch lives in src/execution/expression_executor/:

  • BoundFunctionExpressionexecute_function.cpp
  • BoundCastExpressionexecute_cast.cpp
  • BoundComparisonExpressionexecute_comparison.cpp
  • BoundConjunctionExpressionexecute_conjunction.cpp
  • BoundCaseExpressionexecute_case.cpp

Templated executors

For scalar function authors, src/common/vector_operations/ provides:

  • UnaryExecutor::Execute<TA, TR>(in, out, count, kernel) — one input → one output.
  • BinaryExecutor::Execute<TA, TB, TR>(left, right, out, count, kernel) — two inputs → one output.
  • TernaryExecutor and GenericExecutor — three or more inputs.

These templates handle:

  • Constant/dictionary fast paths.
  • Validity propagation (NULL in any input → NULL output, unless your kernel says otherwise).
  • Per-row dispatch in flat mode.
  • Selection vectors for dictionary inputs.

Most of the scalar functions in extension/core_functions/scalar/ use these templates rather than handwriting their own loops.

Aggregate execution

Aggregates plug into a four-method interface:

  • state_size() — bytes per group.
  • initialize(state) — set group state to identity.
  • update(state, chunk) — fold a chunk of inputs into the state.
  • combine(left, right) — merge two states.
  • finalize(state, output) — produce result vector(s).

The hash aggregate (src/execution/operator/aggregate/physical_hash_aggregate.cpp) and the partitioned hash aggregate (src/execution/radix_partitioned_hashtable.cpp) call into these methods as chunks arrive.

Parallelism

Pipelines may be parallelized by partitioning the source. Each parallel worker has its own LocalSourceState and LocalSinkState. After the source is exhausted, sinks merge their per-thread state via Combine.

sequenceDiagram
    participant W1 as Worker 1 (LocalSinkState)
    participant W2 as Worker 2 (LocalSinkState)
    participant Sink as Sink (GlobalSinkState)
    W1->>W1: Sink chunks into local state
    W2->>W2: Sink chunks into local state
    W1->>Sink: Combine(local) -> global
    W2->>Sink: Combine(local) -> global
    Sink->>Sink: Finalize() -> ready for downstream pipeline

Pipelines that depend on a sink's Finalize event do not start until that finalize runs (see systems/parallel).

Why this works

The vectorized model gets several wins at once:

  • Cache locality. Each operator processes one column at a time within a chunk; data stays hot in L1/L2.
  • SIMD opportunities. Inner loops over int32_t[] or double[] are auto-vectorized by modern compilers.
  • Encoding fast paths. Constant/dictionary inputs short-circuit without touching every row.
  • Predictable allocation. A pipeline reuses the same DataChunk across iterations; the only new allocations are for spillable structures (hash tables, sort runs).
  • Pipeline parallelism. Chunks flow through operators back-to-back without intermediate materialization.

Where to look

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Vectorized execution – DuckDB wiki | Factory