Factory.ai

Open-Source Wikis

/

llama.cpp

/

Packages

/

gguf-py

ggml-org/llama.cpp

gguf-py

Active contributors: CISC

gguf-py/ is a small Python package that reads and writes GGUF files. It is the Python counterpart to the C++ reader/writer in ggml/src/gguf.cpp, and it is what convert_hf_to_gguf.py uses to emit GGUFs.

Purpose

  • Provide a pip-installable Python API for inspecting and creating GGUF files.
  • Power the HuggingFace conversion pipeline (convert_hf_to_gguf.py).
  • Document the canonical names for tensors, hparams, and special tokens that the C++ side expects.

Directory layout

gguf-py/
├── pyproject.toml         # poetry-managed package config
├── README.md
├── examples/              # Tiny demo scripts
├── gguf/                  # The package itself
│   ├── __init__.py
│   ├── constants.py       # Python equivalent of LLM_KV_*, LLM_TENSOR_*
│   ├── gguf_reader.py     # Read a GGUF file
│   ├── gguf_writer.py     # Write a GGUF file
│   ├── tensor_mapping.py  # HF tensor name -> canonical GGUF tensor name
│   ├── vocab.py           # Tokenizer helpers
│   ├── lazy.py            # Lazy mmap-backed numpy arrays
│   ├── quants.py          # Pure-Python (slow) quantization for testing
│   └── ...
└── tests/

Key Python APIs

from gguf import GGUFReader, GGUFWriter, Keys, MODEL_ARCH, MODEL_TENSOR

reader = GGUFReader("model.gguf")
print(reader.read_field(Keys.General.ARCHITECTURE))

writer = GGUFWriter("out.gguf", "llama")
writer.add_uint32(Keys.LLM.CONTEXT_LENGTH, 4096)
writer.add_tensor("token_embd.weight", numpy_array)
writer.write_header_to_file()
writer.write_kv_data_to_file()
writer.write_tensors_to_file()
writer.close()

The full key/tensor name catalog lives in gguf/constants.py and is the authoritative naming convention shared with src/llama-arch.cpp.

How conversion works

convert_hf_to_gguf.py (~651 KB at the repo root) defines a Model base class plus one subclass per supported HuggingFace model family. Each subclass:

  1. Loads the HuggingFace model files (config.json, tokenizer.json, weight shards).
  2. Maps HF tensor names to canonical GGUF tensor names via tensor_mapping.py.
  3. Optionally upcasts/transposes weights to match ggml's expected layouts.
  4. Writes a GGUF using GGUFWriter.

The companion script convert_hf_to_gguf_update.py updates the pre-tokenizer hash table that src/llama-vocab.cpp consults — when a new BPE pre-tokenizer enters the wild, it gets added here.

Other conversion scripts:

  • convert_lora_to_gguf.py — HuggingFace PEFT LoRA → GGUF adapter.
  • convert_llama_ggml_to_gguf.py — Legacy GGML/GGJT → GGUF.
  • examples/convert_legacy_llama.py — Original LLaMA weights → GGUF.

Lazy loading

gguf/lazy.py wraps numpy arrays in lazy mmap-backed proxies so a 70 B-parameter model can be opened without reading all weights into memory. The C++ loader uses the same trick on the reader side.

Tests

gguf-py/tests/ covers reader/writer round-trips. The Python tests are run by .github/workflows/python-* and locally via pytest gguf-py/tests/.

Integration points

  • convert_hf_to_gguf.py — the primary user. Adding a new model architecture usually starts here.
  • src/llama-vocab.cpp — pre-tokenizer hashes are generated by convert_hf_to_gguf_update.py.
  • src/llama-arch.cpp — must agree with gguf/constants.py on every key name and tensor name.

Entry points for modification

  • New tensor name. Add it to gguf/constants.py::MODEL_TENSOR and to gguf/tensor_mapping.py for the relevant arches. Mirror in src/llama-arch.cpp.
  • New metadata key. Add to gguf/constants.py::Keys and to the LLM_KV_* enum in src/llama-arch.h.
  • New conversion path. Subclass Model in convert_hf_to_gguf.py and register it.
  • New tokenizer pre-type. Run convert_hf_to_gguf_update.py, add the corresponding case in src/llama-vocab.cpp, add fixtures to tests/test-tokenizer-*.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

gguf-py – llama.cpp wiki | Factory