ggml-org/llama.cpp

llama-quantize

Active contributors: Georgi Gerganov

llama-quantize produces quantized GGUFs from higher-precision sources. It's a thin CLI wrapping llama_model_quantize from libllama. See tools/quantize/README.md for the full usage matrix.

Purpose

Take an f16/bf16/f32 GGUF (or in some cases a quant input) and write a smaller GGUF using one of the supported ggml_type recipes — k-quants, IQ-quants, MXFP4, or legacy block formats.

Usage

llama-quantize input.gguf output.gguf TYPE

TYPE is a recipe name like Q4_K_M, Q5_K_S, IQ3_XS, Q8_0, MXFP4, etc. Common flags:

Flag	Effect
`--imatrix path`	Use an importance matrix from `llama-imatrix` (best with IQ-quants)
`--include-weights name`, `--exclude-weights name`	Restrict / exempt specific tensors
`--token-embedding-type TYPE`, `--output-tensor-type TYPE`	Override per-role types
`--keep-split`	Preserve a split-GGUF input layout
`--allow-requantize`	Allow input that is already quantized
`--pure`	Quantize every tensor to `TYPE` without per-tensor mixing
`--leave-output-tensor`	Skip quantizing the output projection

tools/quantize/quantize.cpp parses the recipe name into a llama_ftype, fills in a llama_model_quantize_params, and calls llama_model_quantize(input, output, &params) from include/llama.h. The implementation is in src/llama-quant.cpp; see Quantization for the per-tensor mixing rules and the role of imatrix.

graph LR
    Args[argv] -->|parse| Params[llama_model_quantize_params]
    Params --> Driver[llama_model_quantize in src/llama-quant.cpp]
    Driver -->|read tensors| Loader[model loader, no mmap]
    Driver -->|per-tensor type pick| Rules[mix rules + imatrix]
    Driver -->|quantize| Quants[ggml-quants.c]
    Quants --> Out[output.gguf]

Recipe naming

The standard naming is <bits>_<variant> where variant is 0, 1, K, K_S, K_M, K_L, _NL, _XS, _XXS, etc. The K_S/M/L variants in k-quants apply different per-tensor mixes:

_S — smallest; aggressively quantize attention/FFN, keep embedding and output at higher bits.
_M — medium; the most common default.
_L — largest; preserves more high-precision tensors.

For IQ-quants, the variants encode codebook size and block layout.

libllama — llama_model_quantize is the entry point.
ggml-quants.c — reference per-type kernels.
Backends — every backend that should support a new type needs its own dot-product kernel.
gguf-py/ — Python writers used by convert_hf_to_gguf.py produce f16/bf16 GGUFs that llama-quantize consumes.

Entry points for modification

New recipe name. Add the mapping in tools/quantize/quantize.cpp and the per-tensor rules in src/llama-quant.cpp.
New CLI option. Add it to tools/quantize/quantize.cpp and to llama_model_quantize_params in include/llama.h if it's library-level.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama-quantize

Purpose

Usage

How it works

Recipe naming

Importance matrix

Validation

Integration points

Entry points for modification