ggml-org/llama.cpp

gguf-py

Active contributors: CISC

gguf-py/ is a small Python package that reads and writes GGUF files. It is the Python counterpart to the C++ reader/writer in ggml/src/gguf.cpp, and it is what convert_hf_to_gguf.py uses to emit GGUFs.

Purpose

Provide a pip-installable Python API for inspecting and creating GGUF files.
Power the HuggingFace conversion pipeline (convert_hf_to_gguf.py).
Document the canonical names for tensors, hparams, and special tokens that the C++ side expects.

Directory layout

gguf-py/
├── pyproject.toml         # poetry-managed package config
├── README.md
├── examples/              # Tiny demo scripts
├── gguf/                  # The package itself
│   ├── __init__.py
│   ├── constants.py       # Python equivalent of LLM_KV_*, LLM_TENSOR_*
│   ├── gguf_reader.py     # Read a GGUF file
│   ├── gguf_writer.py     # Write a GGUF file
│   ├── tensor_mapping.py  # HF tensor name -> canonical GGUF tensor name
│   ├── vocab.py           # Tokenizer helpers
│   ├── lazy.py            # Lazy mmap-backed numpy arrays
│   ├── quants.py          # Pure-Python (slow) quantization for testing
│   └── ...
└── tests/

Key Python APIs

from gguf import GGUFReader, GGUFWriter, Keys, MODEL_ARCH, MODEL_TENSOR

reader = GGUFReader("model.gguf")
print(reader.read_field(Keys.General.ARCHITECTURE))

writer = GGUFWriter("out.gguf", "llama")
writer.add_uint32(Keys.LLM.CONTEXT_LENGTH, 4096)
writer.add_tensor("token_embd.weight", numpy_array)
writer.write_header_to_file()
writer.write_kv_data_to_file()
writer.write_tensors_to_file()
writer.close()

The full key/tensor name catalog lives in gguf/constants.py and is the authoritative naming convention shared with src/llama-arch.cpp.

How conversion works

convert_hf_to_gguf.py (~651 KB at the repo root) defines a Model base class plus one subclass per supported HuggingFace model family. Each subclass:

Loads the HuggingFace model files (config.json, tokenizer.json, weight shards).
Maps HF tensor names to canonical GGUF tensor names via tensor_mapping.py.
Optionally upcasts/transposes weights to match ggml's expected layouts.
Writes a GGUF using GGUFWriter.

The companion script convert_hf_to_gguf_update.py updates the pre-tokenizer hash table that src/llama-vocab.cpp consults — when a new BPE pre-tokenizer enters the wild, it gets added here.

Other conversion scripts:

convert_lora_to_gguf.py — HuggingFace PEFT LoRA → GGUF adapter.
convert_llama_ggml_to_gguf.py — Legacy GGML/GGJT → GGUF.
examples/convert_legacy_llama.py — Original LLaMA weights → GGUF.

convert_hf_to_gguf.py — the primary user. Adding a new model architecture usually starts here.
src/llama-vocab.cpp — pre-tokenizer hashes are generated by convert_hf_to_gguf_update.py.
src/llama-arch.cpp — must agree with gguf/constants.py on every key name and tensor name.

Entry points for modification

New tensor name. Add it to gguf/constants.py::MODEL_TENSOR and to gguf/tensor_mapping.py for the relevant arches. Mirror in src/llama-arch.cpp.
New metadata key. Add to gguf/constants.py::Keys and to the LLM_KV_* enum in src/llama-arch.h.
New conversion path. Subclass Model in convert_hf_to_gguf.py and register it.
New tokenizer pre-type. Run convert_hf_to_gguf_update.py, add the corresponding case in src/llama-vocab.cpp, add fixtures to tests/test-tokenizer-*.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

gguf-py

Purpose

Directory layout

Key Python APIs

How conversion works

Lazy loading

Tests

Integration points

Entry points for modification