duckdb/duckdb

CSV and file I/O

How DuckDB reads CSV/Parquet/JSON/Arrow/Delta files transparently, including auto-detection, parallelism, multi-file scans, and the file-system abstraction that lets local paths and remote URLs use the same code.

The file-system abstraction

Every read or write goes through FileSystem (src/include/duckdb/common/file_system.hpp). Implementations:

LocalFileSystem (src/common/local_file_system.cpp) — POSIX/Windows.
VirtualFileSystem (src/common/virtual_file_system.cpp) — composite that delegates by URI prefix.
OpenerFileSystem (src/common/opener_file_system.cpp) — wraps a per-connection FileOpener so settings/secrets follow the connection.
Compressed readers: gzip_file_system.cpp, compressed_file_system.cpp, extension/parquet/zstd_file_system.cpp.
Extensions register others: httpfs for http(s):// and s3://, aws for AWS-specific behaviors, azure for Azure Blob.

A user can opt into a file system simply by using the right URI:

SELECT * FROM 's3://bucket/data.parquet';
SELECT * FROM 'https://example.com/file.parquet';
SELECT * FROM 'file:///abs/path.parquet';

PRAGMA threads=N and PRAGMA memory_limit='..' apply to all reads regardless of file system.

Multi-file scans

src/common/multi_file/ provides the shared infrastructure used by the parquet, json, csv, arrow, and delta readers. It handles:

Globbing. read_parquet('data/*.parquet') expands across the underlying file system.
Hive partitioning. data/year=2024/month=01/file.parquet produces virtual year and month columns. See src/common/hive_partitioning.cpp.
Schema reconciliation. Different files in a single scan are validated to share a schema (or unified, depending on options).
Filter pushdown. Partition-column filters skip files entirely.
Per-file error handling. Configurable union/strict/ignore modes.

Each format-specific scanner provides a MultiFileInfo (e.g., parquet_multi_file_info.cpp, json_multi_file_info.cpp, the CSV scanner does the same).

CSV scanner

The CSV scanner is one of the most polished pieces of the engine. It is parallel, auto-detecting, and pushdown-aware.

src/execution/operator/csv_scanner/ contains the scanner machinery:

csv_scanner/
├── scanner/             Per-thread scanning state machines
├── sniffer/             Auto-detection: delimiter, quote, header, types
├── state_machine/       Byte-level state machine for the parser
├── string_value/        String parsing fast paths
├── encoding/            Character-set encoding handlers
├── table_function/      The read_csv / read_csv_auto bind
└── util/                Buffer and chunk helpers

Auto-detection

read_csv_auto('file.csv') (or read_csv with auto_detect=true) samples the first portion of the file and infers:

Delimiter: ,, ;, \t, |, …
Quote: ", ', none.
Escape character: \, ", none.
Header presence. Heuristics on row 1 vs. row 2 type compatibility.
Per-column type. Tries BIGINT, DOUBLE, DATE, TIMESTAMP, BOOLEAN, falls back to VARCHAR.

The sniffer lives in csv_scanner/sniffer/. Detection settings are reported back to the user when a scan fails so they can override (delim=..., quote=..., header=..., dtypes={'col': 'BIGINT'}, etc.).

Parallel parsing

graph LR
    File[CSV file] -->|byte-aligned chunks| Workers[Per-thread parsers]
    Workers -->|state machine| Tokens[Token stream]
    Tokens -->|cast per column type| Vectors[Vector outputs]
    Vectors --> Chunks[DataChunks downstream]

The file is split into byte-aligned chunks; each worker resyncs to the next record boundary using the state machine. After resync, parsing proceeds independently across workers. The state machine in csv_scanner/state_machine/ models the CSV grammar (quoted fields, escapes, embedded newlines) at the byte level.

Error handling

The CSV scanner supports IGNORE_ERRORS = true, MAX_LINE_SIZE = ..., STRICT_MODE = true, and STORE_REJECTS = true (which captures rejected rows into a rejects_table for inspection). Implementation: csv_scanner/scanner/.

Writing CSV

COPY (SELECT ...) TO 'out.csv' (FORMAT CSV, HEADER, DELIMITER ',') uses src/common/csv_writer.cpp for output. The writer produces RFC-4180-compatible CSV by default and supports streaming.

Format	Reader	Writer
Parquet	`extension/parquet/parquet_reader.cpp`	`parquet_writer.cpp`
JSON / NDJSON	`extension/json/json_reader.cpp`	via `csv_writer.cpp`-style streaming
Arrow IPC	`src/common/arrow/` and `extension/parquet/`-adjacent code	streaming via Arrow C Data Interface
Delta	`extension/delta/` (delta-kernel-rs)	not supported (writes via Spark)
TSV / Pipe	CSV scanner with delimiter override	CSV writer with delimiter override

External file cache

src/storage/external_file_cache/ caches data read from remote sources into a configurable disk-backed cache:

Set via SET file_cache_directory = '/path/to/cache' and SET file_cache_max_size = '1GB'.
Indexed by URI + ETag/Last-Modified.
Useful for repeated scans of large remote files.

Hive partitioning

SELECT *
FROM read_parquet('data/year=*/month=*/file.parquet', hive_partitioning=true)
WHERE year = 2024 AND month = 6;

Partition values become virtual columns. The optimizer pushes down predicates on those columns to skip whole directories. Implementation: src/common/hive_partitioning.cpp plus per-format multi-file infos.

File-format introspection

Each format ships introspection helpers:

Function	Source
`parquet_metadata(path)`, `parquet_schema(path)`, `parquet_file_metadata(path)`	`extension/parquet/parquet_metadata.cpp`
`read_csv_auto` (with `return_files=true` and `header=true`)	`csv_scanner/table_function/`
`json_structure(json)`	`extension/json/json_functions/`

Where to look

File system: systems/common.
CSV scanner: src/execution/operator/csv_scanner/, tests in test/csv_scanner/ and test/sql/copy/csv/.
Parquet: extensions/parquet.
JSON: extensions/json.
Multi-file infrastructure: src/common/multi_file/.
Httpfs / S3 (out-of-tree): the httpfs extension repo at https://github.com/duckdb/duckdb-httpfs.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.