Open-Source Wikis

/

DuckDB

/

Features

/

CSV and file I/O

duckdb/duckdb

CSV and file I/O

How DuckDB reads CSV/Parquet/JSON/Arrow/Delta files transparently, including auto-detection, parallelism, multi-file scans, and the file-system abstraction that lets local paths and remote URLs use the same code.

The file-system abstraction

Every read or write goes through FileSystem (src/include/duckdb/common/file_system.hpp). Implementations:

  • LocalFileSystem (src/common/local_file_system.cpp) — POSIX/Windows.
  • VirtualFileSystem (src/common/virtual_file_system.cpp) — composite that delegates by URI prefix.
  • OpenerFileSystem (src/common/opener_file_system.cpp) — wraps a per-connection FileOpener so settings/secrets follow the connection.
  • Compressed readers: gzip_file_system.cpp, compressed_file_system.cpp, extension/parquet/zstd_file_system.cpp.
  • Extensions register others: httpfs for http(s):// and s3://, aws for AWS-specific behaviors, azure for Azure Blob.

A user can opt into a file system simply by using the right URI:

SELECT * FROM 's3://bucket/data.parquet';
SELECT * FROM 'https://example.com/file.parquet';
SELECT * FROM 'file:///abs/path.parquet';

PRAGMA threads=N and PRAGMA memory_limit='..' apply to all reads regardless of file system.

Multi-file scans

src/common/multi_file/ provides the shared infrastructure used by the parquet, json, csv, arrow, and delta readers. It handles:

  • Globbing. read_parquet('data/*.parquet') expands across the underlying file system.
  • Hive partitioning. data/year=2024/month=01/file.parquet produces virtual year and month columns. See src/common/hive_partitioning.cpp.
  • Schema reconciliation. Different files in a single scan are validated to share a schema (or unified, depending on options).
  • Filter pushdown. Partition-column filters skip files entirely.
  • Per-file error handling. Configurable union/strict/ignore modes.

Each format-specific scanner provides a MultiFileInfo (e.g., parquet_multi_file_info.cpp, json_multi_file_info.cpp, the CSV scanner does the same).

CSV scanner

The CSV scanner is one of the most polished pieces of the engine. It is parallel, auto-detecting, and pushdown-aware.

src/execution/operator/csv_scanner/ contains the scanner machinery:

csv_scanner/
├── scanner/             Per-thread scanning state machines
├── sniffer/             Auto-detection: delimiter, quote, header, types
├── state_machine/       Byte-level state machine for the parser
├── string_value/        String parsing fast paths
├── encoding/            Character-set encoding handlers
├── table_function/      The read_csv / read_csv_auto bind
└── util/                Buffer and chunk helpers

Auto-detection

read_csv_auto('file.csv') (or read_csv with auto_detect=true) samples the first portion of the file and infers:

  • Delimiter: ,, ;, \t, |, …
  • Quote: ", ', none.
  • Escape character: \, ", none.
  • Header presence. Heuristics on row 1 vs. row 2 type compatibility.
  • Per-column type. Tries BIGINT, DOUBLE, DATE, TIMESTAMP, BOOLEAN, falls back to VARCHAR.

The sniffer lives in csv_scanner/sniffer/. Detection settings are reported back to the user when a scan fails so they can override (delim=..., quote=..., header=..., dtypes={'col': 'BIGINT'}, etc.).

Parallel parsing

graph LR
    File[CSV file] -->|byte-aligned chunks| Workers[Per-thread parsers]
    Workers -->|state machine| Tokens[Token stream]
    Tokens -->|cast per column type| Vectors[Vector outputs]
    Vectors --> Chunks[DataChunks downstream]

The file is split into byte-aligned chunks; each worker resyncs to the next record boundary using the state machine. After resync, parsing proceeds independently across workers. The state machine in csv_scanner/state_machine/ models the CSV grammar (quoted fields, escapes, embedded newlines) at the byte level.

Error handling

The CSV scanner supports IGNORE_ERRORS = true, MAX_LINE_SIZE = ..., STRICT_MODE = true, and STORE_REJECTS = true (which captures rejected rows into a rejects_table for inspection). Implementation: csv_scanner/scanner/.

Writing CSV

COPY (SELECT ...) TO 'out.csv' (FORMAT CSV, HEADER, DELIMITER ',') uses src/common/csv_writer.cpp for output. The writer produces RFC-4180-compatible CSV by default and supports streaming.

Other readers

Format Reader Writer
Parquet extension/parquet/parquet_reader.cpp parquet_writer.cpp
JSON / NDJSON extension/json/json_reader.cpp via csv_writer.cpp-style streaming
Arrow IPC src/common/arrow/ and extension/parquet/-adjacent code streaming via Arrow C Data Interface
Delta extension/delta/ (delta-kernel-rs) not supported (writes via Spark)
TSV / Pipe CSV scanner with delimiter override CSV writer with delimiter override

External file cache

src/storage/external_file_cache/ caches data read from remote sources into a configurable disk-backed cache:

  • Set via SET file_cache_directory = '/path/to/cache' and SET file_cache_max_size = '1GB'.
  • Indexed by URI + ETag/Last-Modified.
  • Useful for repeated scans of large remote files.

Hive partitioning

SELECT *
FROM read_parquet('data/year=*/month=*/file.parquet', hive_partitioning=true)
WHERE year = 2024 AND month = 6;

Partition values become virtual columns. The optimizer pushes down predicates on those columns to skip whole directories. Implementation: src/common/hive_partitioning.cpp plus per-format multi-file infos.

File-format introspection

Each format ships introspection helpers:

Function Source
parquet_metadata(path), parquet_schema(path), parquet_file_metadata(path) extension/parquet/parquet_metadata.cpp
read_csv_auto (with return_files=true and header=true) csv_scanner/table_function/
json_structure(json) extension/json/json_functions/

Where to look

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

CSV and file I/O – DuckDB wiki | Factory