Open-Source Wikis

/

DuckDB

/

Extensions

/

Parquet extension

duckdb/duckdb

Parquet extension

Active contributors: Mytherin, Tishj, Laurens Kuiper

Purpose

extension/parquet/ provides full Parquet read and write support: parquet_scan / read_parquet (table function), COPY ... TO '...' (FORMAT PARQUET) (copy function), the parquet_metadata / parquet_schema introspection functions, encryption per the Parquet spec, and decimal/geometry/UUID/INTERVAL handling. It is the single largest in-tree extension and a frequent target of optimization work.

Directory layout

extension/parquet/
├── parquet_extension.cpp           Registration entry: read/write/copy + replacement scans
├── parquet_reader.cpp              Read path: row-group/column-chunk reader
├── parquet_writer.cpp              Write path: row-group construction + footer
├── column_reader.cpp               Per-column reader: levels + values
├── column_writer.cpp               Per-column writer
├── parquet_column_schema.cpp       Logical-to-Parquet schema mapping
├── parquet_field_id.cpp            Field-ID handling
├── parquet_file_metadata_cache.cpp  Cache of file metadata
├── parquet_metadata.cpp            parquet_metadata(), parquet_schema() functions
├── parquet_multi_file_info.cpp     Multi-file scan info (used with hive partitions)
├── parquet_statistics.cpp          Row-group / page statistics
├── parquet_geometry.cpp            GEOMETRY / GeoArrow handling
├── parquet_timestamp.cpp           Timestamp + INT64 ns/us/ms semantics
├── parquet_float16.cpp             FLOAT16 support
├── parquet_crypto.cpp              Parquet Modular Encryption
├── parquet_shredding.cpp           Variant shredding (nested type optimization)
├── serialize_parquet.cpp           Serialize parquet bind-data
├── zstd_file_system.cpp            ZSTD-compressed page handling
├── decoder/                        Page decoders (PLAIN, RLE_DICTIONARY, DELTA_*, BYTE_STREAM_SPLIT, …)
├── reader/                         Multi-file scan support
├── writer/                         Page writers per encoding
└── include/                        Public headers (parquet_extension.hpp + helpers)

What it provides

Functions registered

Function Kind Purpose
parquet_scan(path, ...) / read_parquet(path, ...) Table Read one or many parquet files. Supports projection pushdown, filter pushdown, and hive partitioning.
parquet_metadata(path) Table Per-row-group metadata including codec, encoding, statistics.
parquet_schema(path) Table Logical and physical schema.
parquet_file_metadata(path) Table File-level metadata.
COPY (SELECT ...) TO 'out.parquet' (FORMAT PARQUET, ...) Copy Write a query result to one or more Parquet files.

A replacement scan registers '.parquet' patterns so that SELECT * FROM 'data.parquet' works without an explicit function call.

Encodings supported

Encoding Reader Writer
PLAIN yes yes
RLE_DICTIONARY yes yes
DELTA_BINARY_PACKED yes yes
DELTA_BYTE_ARRAY yes yes
DELTA_LENGTH_BYTE_ARRAY yes yes
BYTE_STREAM_SPLIT yes yes
RLE (booleans) yes yes

Compression codecs: SNAPPY, GZIP, ZSTD, LZ4_RAW, BROTLI, uncompressed.

Type mapping

parquet_column_schema.cpp maps DuckDB LogicalTypes to Parquet schemas:

  • Integer types map to physical INT32 / INT64 with LogicalType::Integer annotations.
  • DECIMAL uses INT32 / INT64 / FIXED_LEN_BYTE_ARRAY depending on precision.
  • TIMESTAMP and TIMESTAMP_TZ use INT64 with the TIMESTAMP logical annotation (configurable as MS, US, or NS).
  • LIST<T>, STRUCT<...>, MAP<K, V> use Parquet's nested encoding.
  • UUID, INTERVAL, and DuckDB's geometry types use FIXED_LEN_BYTE_ARRAY with documented annotations.

How it works

Reader

graph LR
    Open[Open file] --> Footer[Read footer]
    Footer --> Schema[Parse schema + metadata]
    Schema --> Cache[parquet_file_metadata_cache]
    Cache --> Plan[Plan scan: row-group + column projection]
    Plan --> Workers[Per-row-group, per-column tasks]
    Workers --> Decoders[Page decoders]
    Decoders --> Vectors[Convert pages to Vectors]
    Vectors --> Chunk[Emit DataChunk]

Key steps:

  1. Open the file via FileSystem (so httpfs, S3, etc. work transparently).
  2. Read the footer and parse the Thrift FileMetaData. Cache it in parquet_file_metadata_cache.cpp.
  3. Apply projection pushdown (only the requested columns are read) and filter pushdown (use row-group statistics from parquet_statistics.cpp to skip whole row groups).
  4. For each row group, dispatch per-column readers. column_reader.cpp orchestrates the per-page decoder loop.
  5. Decode pages using the appropriate decoder from decoder/ (e.g., delta_binary_packed_decoder.cpp, dictionary_decoder.cpp, byte_stream_split_decoder.cpp).
  6. Convert the decoded values into DuckDB Vectors, honoring nullability (Parquet's repetition/definition levels).

Writer

graph LR
    Sink[COPY TO sink] -->|chunk| Buffer[Per-column buffers]
    Buffer -->|when full| Pages[Emit pages with chosen encoding]
    Pages -->|after row group size| RG[Finalize row group statistics]
    RG -->|on finish| Footer[Write footer]
    Footer --> File[Close file]

The writer assembles row groups in memory until the configured size is reached (row_group_size_bytes, default 100 MB-ish), then flushes pages and updates statistics. Encoding selection is per-column-and-page, controlled by simple heuristics in column_writer.cpp (e.g., dictionary if cardinality is low; delta for sorted integers; plain otherwise).

Encryption

parquet_crypto.cpp implements Parquet Modular Encryption. Both AES_GCM_V1 and AES_GCM_CTR_V1 modes are supported. Keys are sourced from the secret manager (see systems/main) so that S3-encrypted parquet files can be decrypted transparently.

Statistics

parquet_statistics.cpp writes per-column min/max/null-count/distinct-count for every row group and every page. The reader uses these to:

  • Skip row groups whose stats prove no row matches a pushed-down filter.
  • Provide cardinality hints to the optimizer when binding parquet_scan.

Variant shredding

parquet_shredding.cpp implements the Parquet variant-shredding extension, which decomposes a VARIANT column into typed shards for better compression and pushdown.

Integration points

  • File system: All I/O goes through DuckDB's FileSystem so httpfs, S3, Azure, encrypted file systems work transparently.
  • Multi-file: parquet_multi_file_info.cpp plugs into src/common/multi_file/ to support hive partitions and globs in one scan.
  • Replacement scan: Registered in parquet_extension.cpp so file paths are auto-routed.
  • Compression codecs: Reuses zstd integration in zstd_file_system.cpp and the third_party/snappy/ etc. libraries.
  • Catalog: Bind-time payloads (ParquetReadBindData) participate in plan serialization through serialize_parquet.cpp.

Entry points for modification

  • Adding a new encoding decoder: drop a class into decoder/ and register it in the dispatch in column_reader.cpp.
  • Adding writer encodings: see writer/ and column_writer.cpp.
  • Adding a bind-time pushdown: see parquet_extension.cpp for pushdown_complex_filter and projection_pushdown registration.
  • Adjusting metadata caching: parquet_file_metadata_cache.cpp (LRU keyed on path + last-modified).
  • Tests: test/parquet/, test/sql/copy/parquet/, and the slow stress tests in test/sql/copy/parquet/parquet_*.test_slow.

Key source files

File Purpose
extension/parquet/parquet_extension.cpp Top-level registration.
extension/parquet/parquet_reader.cpp Read path.
extension/parquet/parquet_writer.cpp Write path.
extension/parquet/column_reader.cpp Per-column reader.
extension/parquet/column_writer.cpp Per-column writer.
extension/parquet/parquet_metadata.cpp Metadata table functions.
extension/parquet/parquet_statistics.cpp Statistics emit/consume.
extension/parquet/parquet_crypto.cpp Modular Encryption.
extension/parquet/decoder/ Page decoders.
extension/parquet/reader/ Multi-file readers.

See extensions/json for the JSON extension and systems/common for the multi-file scan helpers reused here.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Parquet extension – DuckDB wiki | Factory