duckdb/duckdb
CSV and file I/O
How DuckDB reads CSV/Parquet/JSON/Arrow/Delta files transparently, including auto-detection, parallelism, multi-file scans, and the file-system abstraction that lets local paths and remote URLs use the same code.
The file-system abstraction
Every read or write goes through FileSystem (src/include/duckdb/common/file_system.hpp). Implementations:
LocalFileSystem(src/common/local_file_system.cpp) — POSIX/Windows.VirtualFileSystem(src/common/virtual_file_system.cpp) — composite that delegates by URI prefix.OpenerFileSystem(src/common/opener_file_system.cpp) — wraps a per-connectionFileOpenerso settings/secrets follow the connection.- Compressed readers:
gzip_file_system.cpp,compressed_file_system.cpp,extension/parquet/zstd_file_system.cpp. - Extensions register others:
httpfsforhttp(s)://ands3://,awsfor AWS-specific behaviors,azurefor Azure Blob.
A user can opt into a file system simply by using the right URI:
SELECT * FROM 's3://bucket/data.parquet';
SELECT * FROM 'https://example.com/file.parquet';
SELECT * FROM 'file:///abs/path.parquet';PRAGMA threads=N and PRAGMA memory_limit='..' apply to all reads regardless of file system.
Multi-file scans
src/common/multi_file/ provides the shared infrastructure used by the parquet, json, csv, arrow, and delta readers. It handles:
- Globbing.
read_parquet('data/*.parquet')expands across the underlying file system. - Hive partitioning.
data/year=2024/month=01/file.parquetproduces virtualyearandmonthcolumns. Seesrc/common/hive_partitioning.cpp. - Schema reconciliation. Different files in a single scan are validated to share a schema (or unified, depending on options).
- Filter pushdown. Partition-column filters skip files entirely.
- Per-file error handling. Configurable union/strict/ignore modes.
Each format-specific scanner provides a MultiFileInfo (e.g., parquet_multi_file_info.cpp, json_multi_file_info.cpp, the CSV scanner does the same).
CSV scanner
The CSV scanner is one of the most polished pieces of the engine. It is parallel, auto-detecting, and pushdown-aware.
src/execution/operator/csv_scanner/ contains the scanner machinery:
csv_scanner/
├── scanner/ Per-thread scanning state machines
├── sniffer/ Auto-detection: delimiter, quote, header, types
├── state_machine/ Byte-level state machine for the parser
├── string_value/ String parsing fast paths
├── encoding/ Character-set encoding handlers
├── table_function/ The read_csv / read_csv_auto bind
└── util/ Buffer and chunk helpersAuto-detection
read_csv_auto('file.csv') (or read_csv with auto_detect=true) samples the first portion of the file and infers:
- Delimiter:
,,;,\t,|, … - Quote:
",', none. - Escape character:
\,", none. - Header presence. Heuristics on row 1 vs. row 2 type compatibility.
- Per-column type. Tries
BIGINT,DOUBLE,DATE,TIMESTAMP,BOOLEAN, falls back toVARCHAR.
The sniffer lives in csv_scanner/sniffer/. Detection settings are reported back to the user when a scan fails so they can override (delim=..., quote=..., header=..., dtypes={'col': 'BIGINT'}, etc.).
Parallel parsing
graph LR
File[CSV file] -->|byte-aligned chunks| Workers[Per-thread parsers]
Workers -->|state machine| Tokens[Token stream]
Tokens -->|cast per column type| Vectors[Vector outputs]
Vectors --> Chunks[DataChunks downstream]The file is split into byte-aligned chunks; each worker resyncs to the next record boundary using the state machine. After resync, parsing proceeds independently across workers. The state machine in csv_scanner/state_machine/ models the CSV grammar (quoted fields, escapes, embedded newlines) at the byte level.
Error handling
The CSV scanner supports IGNORE_ERRORS = true, MAX_LINE_SIZE = ..., STRICT_MODE = true, and STORE_REJECTS = true (which captures rejected rows into a rejects_table for inspection). Implementation: csv_scanner/scanner/.
Writing CSV
COPY (SELECT ...) TO 'out.csv' (FORMAT CSV, HEADER, DELIMITER ',') uses src/common/csv_writer.cpp for output. The writer produces RFC-4180-compatible CSV by default and supports streaming.
Other readers
| Format | Reader | Writer |
|---|---|---|
| Parquet | extension/parquet/parquet_reader.cpp |
parquet_writer.cpp |
| JSON / NDJSON | extension/json/json_reader.cpp |
via csv_writer.cpp-style streaming |
| Arrow IPC | src/common/arrow/ and extension/parquet/-adjacent code |
streaming via Arrow C Data Interface |
| Delta | extension/delta/ (delta-kernel-rs) |
not supported (writes via Spark) |
| TSV / Pipe | CSV scanner with delimiter override | CSV writer with delimiter override |
External file cache
src/storage/external_file_cache/ caches data read from remote sources into a configurable disk-backed cache:
- Set via
SET file_cache_directory = '/path/to/cache'andSET file_cache_max_size = '1GB'. - Indexed by URI + ETag/Last-Modified.
- Useful for repeated scans of large remote files.
Hive partitioning
SELECT *
FROM read_parquet('data/year=*/month=*/file.parquet', hive_partitioning=true)
WHERE year = 2024 AND month = 6;Partition values become virtual columns. The optimizer pushes down predicates on those columns to skip whole directories. Implementation: src/common/hive_partitioning.cpp plus per-format multi-file infos.
File-format introspection
Each format ships introspection helpers:
| Function | Source |
|---|---|
parquet_metadata(path), parquet_schema(path), parquet_file_metadata(path) |
extension/parquet/parquet_metadata.cpp |
read_csv_auto (with return_files=true and header=true) |
csv_scanner/table_function/ |
json_structure(json) |
extension/json/json_functions/ |
Where to look
- File system: systems/common.
- CSV scanner:
src/execution/operator/csv_scanner/, tests intest/csv_scanner/andtest/sql/copy/csv/. - Parquet: extensions/parquet.
- JSON: extensions/json.
- Multi-file infrastructure:
src/common/multi_file/. - Httpfs / S3 (out-of-tree): the
httpfsextension repo at https://github.com/duckdb/duckdb-httpfs.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.