Open-Source Wikis

/

DuckDB

/

By the numbers

duckdb/duckdb

By the numbers

A quantitative snapshot of the DuckDB codebase.

Data collected on 2026-04-30 from the main branch at commit aec1efc176.

Size

xychart-beta horizontal
    title "Source files by extension"
    x-axis ["*.cpp", "*.hpp (headers)", "*.test*", "*.py", "CMakeLists.txt"]
    y-axis "File count" 0 --> 5500
    bar [2307, 1612, 4874, 154, 243]
Metric Value
C++ source files (*.cpp) in repo 2,307
Public/internal headers (*.hpp) 1,612
C++ source lines in src/ 337,292
C++ source files in src/ 1,384
Sqllogictest files (*.test*) 4,874
Python build/codegen scripts 154
CMakeLists.txt files 243
Top-level subsystems under src/ 13 (parser, planner, optimizer, execution, storage, catalog, transaction, parallel, function, common, main, logging, include)
In-tree extensions under extension/ 11 (parquet, json, icu, jemalloc, autocomplete, core_functions, delta, demo_capi, loader, tpcds, tpch)
GitHub Actions workflows 31 (.github/workflows/)

The largest single source files reflect the project's hotspots:

File Lines
src/common/enum_util.cpp ~292 KB (machine-generated enum utilities)
src/storage/data_table.cpp ~73 KB
src/common/types.cpp ~69 KB
src/storage/single_file_block_manager.cpp ~54 KB
src/main/client_context.cpp ~58 KB
src/catalog/catalog.cpp ~53 KB
src/optimizer/topn_window_elimination.cpp ~49 KB
src/optimizer/remove_unused_columns.cpp ~48 KB

Activity

Metric Value
Total commits on main ~74,344
First commit 2018-07-13 ("Working parser + initial draft of interface", Mark Raasveldt)
Most recent commit at snapshot 2026-04-30
Unique authors (excluding bots) ~726
Active authors in the last 90 days 50+

Most active subsystems by commits in 2024-onwards:

Directory Top contributors (recent)
src/common/ Mytherin, Mark, Laurens Kuiper
src/execution/ Pedro Holanda, Mark, Laurens Kuiper
src/storage/ Mytherin, Tishj, Mark
src/main/ Mark, Tishj, Mytherin
src/function/ Tishj, Mark, Mytherin
src/optimizer/ Laurens Kuiper, Tmonster, Tom Ebergen
src/parser/ Mytherin, Tishj, dtenwolde
extension/parquet/ Mytherin, Tishj, Laurens Kuiper

The two highest-output committers since project inception are Mark Raasveldt (Mytherin, 3,400+ commits as Mytherin alone) and Hannes Mühleisen (3,500+ commits) — DuckDB's two co-creators.

Bot-attributed commits

The DuckDB project explicitly does not accept LLM-generated pull requests (see CONTRIBUTING.md, "Generative AI Policy"). Bot-authored commits in git log are limited to:

  • dependabot[bot] for dependency bumps
  • github-actions[bot] for release-cut and CI helper commits

Together these account for well under 1% of commits on main. There is no automated AI co-author trail in this codebase. Note that this is a lower bound on AI assistance — inline tools that do not leave a co-author tag would not be visible.

Complexity

xychart-beta horizontal
    title "Lines of code by top-level src/ subsystem (approximate)"
    x-axis ["common", "execution", "storage", "function", "main", "optimizer", "planner", "parser", "catalog", "transaction", "parallel"]
    y-axis "Lines" 0 --> 100000
    bar [85000, 60000, 55000, 35000, 32000, 30000, 22000, 18000, 12000, 9000, 7000]

(Subsystem sizes are approximate, derived from find … -name '*.cpp' | xargs wc -l.)

Two structural observations:

  1. common/ is the largest subsystem. It holds vector primitives, file system abstraction, type system, allocators, and the box renderer used by the CLI. Most other subsystems depend on it transitively.
  2. The optimizer has wide files, not deep ones. filter_combiner.cpp, remove_unused_columns.cpp, and topn_window_elimination.cpp are individually large but each represents a self-contained transformation rather than a layered system.

Test surface

Metric Value
test/sql/ .test files ~4,500
test/sql/ .test_slow files ~370
test/sql/ topic directories 72
C++ API tests in test/api/ several hundred
Smoke test count in test/smoke_tests.list ~12 KB worth of paths

The sqllogictest framework is the dominant test surface; C++ tests are reserved for cases that cannot be expressed in SQL (concurrency, low-level APIs, fuzzers).

Dependencies

DuckDB has no required runtime dependency outside the C++17 standard library. Bundled third-party libraries live in third_party/ and are vendored as source. They include pcg, zstd, re2, utf8proc, httplib, mbedtls, pegtl, parquet, lz4, concurrentqueue, fmt, thrift, and json (nlohmann). See reference/dependencies.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

By the numbers – DuckDB wiki | Factory