duckdb/duckdb
By the numbers
A quantitative snapshot of the DuckDB codebase.
Data collected on 2026-04-30 from the
mainbranch at commitaec1efc176.
Size
xychart-beta horizontal
title "Source files by extension"
x-axis ["*.cpp", "*.hpp (headers)", "*.test*", "*.py", "CMakeLists.txt"]
y-axis "File count" 0 --> 5500
bar [2307, 1612, 4874, 154, 243]| Metric | Value |
|---|---|
C++ source files (*.cpp) in repo |
2,307 |
Public/internal headers (*.hpp) |
1,612 |
C++ source lines in src/ |
337,292 |
C++ source files in src/ |
1,384 |
Sqllogictest files (*.test*) |
4,874 |
| Python build/codegen scripts | 154 |
CMakeLists.txt files |
243 |
Top-level subsystems under src/ |
13 (parser, planner, optimizer, execution, storage, catalog, transaction, parallel, function, common, main, logging, include) |
In-tree extensions under extension/ |
11 (parquet, json, icu, jemalloc, autocomplete, core_functions, delta, demo_capi, loader, tpcds, tpch) |
| GitHub Actions workflows | 31 (.github/workflows/) |
The largest single source files reflect the project's hotspots:
| File | Lines |
|---|---|
src/common/enum_util.cpp |
~292 KB (machine-generated enum utilities) |
src/storage/data_table.cpp |
~73 KB |
src/common/types.cpp |
~69 KB |
src/storage/single_file_block_manager.cpp |
~54 KB |
src/main/client_context.cpp |
~58 KB |
src/catalog/catalog.cpp |
~53 KB |
src/optimizer/topn_window_elimination.cpp |
~49 KB |
src/optimizer/remove_unused_columns.cpp |
~48 KB |
Activity
| Metric | Value |
|---|---|
Total commits on main |
~74,344 |
| First commit | 2018-07-13 ("Working parser + initial draft of interface", Mark Raasveldt) |
| Most recent commit at snapshot | 2026-04-30 |
| Unique authors (excluding bots) | ~726 |
| Active authors in the last 90 days | 50+ |
Most active subsystems by commits in 2024-onwards:
| Directory | Top contributors (recent) |
|---|---|
src/common/ |
Mytherin, Mark, Laurens Kuiper |
src/execution/ |
Pedro Holanda, Mark, Laurens Kuiper |
src/storage/ |
Mytherin, Tishj, Mark |
src/main/ |
Mark, Tishj, Mytherin |
src/function/ |
Tishj, Mark, Mytherin |
src/optimizer/ |
Laurens Kuiper, Tmonster, Tom Ebergen |
src/parser/ |
Mytherin, Tishj, dtenwolde |
extension/parquet/ |
Mytherin, Tishj, Laurens Kuiper |
The two highest-output committers since project inception are Mark Raasveldt (Mytherin, 3,400+ commits as Mytherin alone) and Hannes Mühleisen (3,500+ commits) — DuckDB's two co-creators.
Bot-attributed commits
The DuckDB project explicitly does not accept LLM-generated pull requests (see CONTRIBUTING.md, "Generative AI Policy"). Bot-authored commits in git log are limited to:
dependabot[bot]for dependency bumpsgithub-actions[bot]for release-cut and CI helper commits
Together these account for well under 1% of commits on main. There is no automated AI co-author trail in this codebase. Note that this is a lower bound on AI assistance — inline tools that do not leave a co-author tag would not be visible.
Complexity
xychart-beta horizontal
title "Lines of code by top-level src/ subsystem (approximate)"
x-axis ["common", "execution", "storage", "function", "main", "optimizer", "planner", "parser", "catalog", "transaction", "parallel"]
y-axis "Lines" 0 --> 100000
bar [85000, 60000, 55000, 35000, 32000, 30000, 22000, 18000, 12000, 9000, 7000](Subsystem sizes are approximate, derived from find … -name '*.cpp' | xargs wc -l.)
Two structural observations:
common/is the largest subsystem. It holds vector primitives, file system abstraction, type system, allocators, and the box renderer used by the CLI. Most other subsystems depend on it transitively.- The optimizer has wide files, not deep ones.
filter_combiner.cpp,remove_unused_columns.cpp, andtopn_window_elimination.cppare individually large but each represents a self-contained transformation rather than a layered system.
Test surface
| Metric | Value |
|---|---|
test/sql/ .test files |
~4,500 |
test/sql/ .test_slow files |
~370 |
test/sql/ topic directories |
72 |
C++ API tests in test/api/ |
several hundred |
Smoke test count in test/smoke_tests.list |
~12 KB worth of paths |
The sqllogictest framework is the dominant test surface; C++ tests are reserved for cases that cannot be expressed in SQL (concurrency, low-level APIs, fuzzers).
Dependencies
DuckDB has no required runtime dependency outside the C++17 standard library. Bundled third-party libraries live in third_party/ and are vendored as source. They include pcg, zstd, re2, utf8proc, httplib, mbedtls, pegtl, parquet, lz4, concurrentqueue, fmt, thrift, and json (nlohmann). See reference/dependencies.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.