ggml-org/llama.cpp
gguf-py
Active contributors: CISC
gguf-py/ is a small Python package that reads and writes GGUF files. It is the Python counterpart to the C++ reader/writer in ggml/src/gguf.cpp, and it is what convert_hf_to_gguf.py uses to emit GGUFs.
Purpose
- Provide a pip-installable Python API for inspecting and creating GGUF files.
- Power the HuggingFace conversion pipeline (
convert_hf_to_gguf.py). - Document the canonical names for tensors, hparams, and special tokens that the C++ side expects.
Directory layout
gguf-py/
├── pyproject.toml # poetry-managed package config
├── README.md
├── examples/ # Tiny demo scripts
├── gguf/ # The package itself
│ ├── __init__.py
│ ├── constants.py # Python equivalent of LLM_KV_*, LLM_TENSOR_*
│ ├── gguf_reader.py # Read a GGUF file
│ ├── gguf_writer.py # Write a GGUF file
│ ├── tensor_mapping.py # HF tensor name -> canonical GGUF tensor name
│ ├── vocab.py # Tokenizer helpers
│ ├── lazy.py # Lazy mmap-backed numpy arrays
│ ├── quants.py # Pure-Python (slow) quantization for testing
│ └── ...
└── tests/Key Python APIs
from gguf import GGUFReader, GGUFWriter, Keys, MODEL_ARCH, MODEL_TENSOR
reader = GGUFReader("model.gguf")
print(reader.read_field(Keys.General.ARCHITECTURE))
writer = GGUFWriter("out.gguf", "llama")
writer.add_uint32(Keys.LLM.CONTEXT_LENGTH, 4096)
writer.add_tensor("token_embd.weight", numpy_array)
writer.write_header_to_file()
writer.write_kv_data_to_file()
writer.write_tensors_to_file()
writer.close()The full key/tensor name catalog lives in gguf/constants.py and is the authoritative naming convention shared with src/llama-arch.cpp.
How conversion works
convert_hf_to_gguf.py (~651 KB at the repo root) defines a Model base class plus one subclass per supported HuggingFace model family. Each subclass:
- Loads the HuggingFace model files (config.json, tokenizer.json, weight shards).
- Maps HF tensor names to canonical GGUF tensor names via
tensor_mapping.py. - Optionally upcasts/transposes weights to match
ggml's expected layouts. - Writes a GGUF using
GGUFWriter.
The companion script convert_hf_to_gguf_update.py updates the pre-tokenizer hash table that src/llama-vocab.cpp consults — when a new BPE pre-tokenizer enters the wild, it gets added here.
Other conversion scripts:
convert_lora_to_gguf.py— HuggingFace PEFT LoRA → GGUF adapter.convert_llama_ggml_to_gguf.py— Legacy GGML/GGJT → GGUF.examples/convert_legacy_llama.py— Original LLaMA weights → GGUF.
Lazy loading
gguf/lazy.py wraps numpy arrays in lazy mmap-backed proxies so a 70 B-parameter model can be opened without reading all weights into memory. The C++ loader uses the same trick on the reader side.
Tests
gguf-py/tests/ covers reader/writer round-trips. The Python tests are run by .github/workflows/python-* and locally via pytest gguf-py/tests/.
Integration points
convert_hf_to_gguf.py— the primary user. Adding a new model architecture usually starts here.src/llama-vocab.cpp— pre-tokenizer hashes are generated byconvert_hf_to_gguf_update.py.src/llama-arch.cpp— must agree withgguf/constants.pyon every key name and tensor name.
Entry points for modification
- New tensor name. Add it to
gguf/constants.py::MODEL_TENSORand togguf/tensor_mapping.pyfor the relevant arches. Mirror insrc/llama-arch.cpp. - New metadata key. Add to
gguf/constants.py::Keysand to theLLM_KV_*enum insrc/llama-arch.h. - New conversion path. Subclass
Modelinconvert_hf_to_gguf.pyand register it. - New tokenizer pre-type. Run
convert_hf_to_gguf_update.py, add the corresponding case insrc/llama-vocab.cpp, add fixtures totests/test-tokenizer-*.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
ggml
Next
API