ggml-org/llama.cpp
llama-quantize
Active contributors: Georgi Gerganov
llama-quantize produces quantized GGUFs from higher-precision sources. It's a thin CLI wrapping llama_model_quantize from libllama. See tools/quantize/README.md for the full usage matrix.
Purpose
Take an f16/bf16/f32 GGUF (or in some cases a quant input) and write a smaller GGUF using one of the supported ggml_type recipes — k-quants, IQ-quants, MXFP4, or legacy block formats.
Usage
llama-quantize input.gguf output.gguf TYPETYPE is a recipe name like Q4_K_M, Q5_K_S, IQ3_XS, Q8_0, MXFP4, etc. Common flags:
| Flag | Effect |
|---|---|
--imatrix path |
Use an importance matrix from llama-imatrix (best with IQ-quants) |
--include-weights name, --exclude-weights name |
Restrict / exempt specific tensors |
--token-embedding-type TYPE, --output-tensor-type TYPE |
Override per-role types |
--keep-split |
Preserve a split-GGUF input layout |
--allow-requantize |
Allow input that is already quantized |
--pure |
Quantize every tensor to TYPE without per-tensor mixing |
--leave-output-tensor |
Skip quantizing the output projection |
How it works
tools/quantize/quantize.cpp parses the recipe name into a llama_ftype, fills in a llama_model_quantize_params, and calls llama_model_quantize(input, output, ¶ms) from include/llama.h. The implementation is in src/llama-quant.cpp; see Quantization for the per-tensor mixing rules and the role of imatrix.
graph LR
Args[argv] -->|parse| Params[llama_model_quantize_params]
Params --> Driver[llama_model_quantize in src/llama-quant.cpp]
Driver -->|read tensors| Loader[model loader, no mmap]
Driver -->|per-tensor type pick| Rules[mix rules + imatrix]
Driver -->|quantize| Quants[ggml-quants.c]
Quants --> Out[output.gguf]Recipe naming
The standard naming is <bits>_<variant> where variant is 0, 1, K, K_S, K_M, K_L, _NL, _XS, _XXS, etc. The K_S/M/L variants in k-quants apply different per-tensor mixes:
_S— smallest; aggressively quantize attention/FFN, keep embedding and output at higher bits._M— medium; the most common default._L— largest; preserves more high-precision tensors.
For IQ-quants, the variants encode codebook size and block layout.
Importance matrix
To get the best quality from IQ-quants, run llama-imatrix first to produce an activation profile, then pass --imatrix imatrix.gguf to llama-quantize. See imatrix tool and the IQ-quant references.
Validation
After quantization, validate quality with llama-perplexity and speed with llama-bench. The CONTRIBUTING.md standard for new quant types is to provide perplexity, KL-divergence, and benchmark numbers vs FP16/BF16 and vs same-size types.
Integration points
libllama—llama_model_quantizeis the entry point.ggml-quants.c— reference per-type kernels.- Backends — every backend that should support a new type needs its own dot-product kernel.
gguf-py/— Python writers used byconvert_hf_to_gguf.pyproduce f16/bf16 GGUFs thatllama-quantizeconsumes.
Entry points for modification
- New recipe name. Add the mapping in
tools/quantize/quantize.cppand the per-tensor rules insrc/llama-quant.cpp. - New CLI option. Add it to
tools/quantize/quantize.cppand tollama_model_quantize_paramsininclude/llama.hif it's library-level.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama-server
Next
llama-imatrix