Factory.ai

Open-Source Wikis

/

llama.cpp

/

Tools

/

llama-quantize

ggml-org/llama.cpp

llama-quantize

Active contributors: Georgi Gerganov

llama-quantize produces quantized GGUFs from higher-precision sources. It's a thin CLI wrapping llama_model_quantize from libllama. See tools/quantize/README.md for the full usage matrix.

Purpose

Take an f16/bf16/f32 GGUF (or in some cases a quant input) and write a smaller GGUF using one of the supported ggml_type recipes — k-quants, IQ-quants, MXFP4, or legacy block formats.

Usage

llama-quantize input.gguf output.gguf TYPE

TYPE is a recipe name like Q4_K_M, Q5_K_S, IQ3_XS, Q8_0, MXFP4, etc. Common flags:

Flag Effect
--imatrix path Use an importance matrix from llama-imatrix (best with IQ-quants)
--include-weights name, --exclude-weights name Restrict / exempt specific tensors
--token-embedding-type TYPE, --output-tensor-type TYPE Override per-role types
--keep-split Preserve a split-GGUF input layout
--allow-requantize Allow input that is already quantized
--pure Quantize every tensor to TYPE without per-tensor mixing
--leave-output-tensor Skip quantizing the output projection

How it works

tools/quantize/quantize.cpp parses the recipe name into a llama_ftype, fills in a llama_model_quantize_params, and calls llama_model_quantize(input, output, &params) from include/llama.h. The implementation is in src/llama-quant.cpp; see Quantization for the per-tensor mixing rules and the role of imatrix.

graph LR
    Args[argv] -->|parse| Params[llama_model_quantize_params]
    Params --> Driver[llama_model_quantize in src/llama-quant.cpp]
    Driver -->|read tensors| Loader[model loader, no mmap]
    Driver -->|per-tensor type pick| Rules[mix rules + imatrix]
    Driver -->|quantize| Quants[ggml-quants.c]
    Quants --> Out[output.gguf]

Recipe naming

The standard naming is <bits>_<variant> where variant is 0, 1, K, K_S, K_M, K_L, _NL, _XS, _XXS, etc. The K_S/M/L variants in k-quants apply different per-tensor mixes:

  • _S — smallest; aggressively quantize attention/FFN, keep embedding and output at higher bits.
  • _M — medium; the most common default.
  • _L — largest; preserves more high-precision tensors.

For IQ-quants, the variants encode codebook size and block layout.

Importance matrix

To get the best quality from IQ-quants, run llama-imatrix first to produce an activation profile, then pass --imatrix imatrix.gguf to llama-quantize. See imatrix tool and the IQ-quant references.

Validation

After quantization, validate quality with llama-perplexity and speed with llama-bench. The CONTRIBUTING.md standard for new quant types is to provide perplexity, KL-divergence, and benchmark numbers vs FP16/BF16 and vs same-size types.

Integration points

  • libllamallama_model_quantize is the entry point.
  • ggml-quants.c — reference per-type kernels.
  • Backends — every backend that should support a new type needs its own dot-product kernel.
  • gguf-py/ — Python writers used by convert_hf_to_gguf.py produce f16/bf16 GGUFs that llama-quantize consumes.

Entry points for modification

  • New recipe name. Add the mapping in tools/quantize/quantize.cpp and the per-tensor rules in src/llama-quant.cpp.
  • New CLI option. Add it to tools/quantize/quantize.cpp and to llama_model_quantize_params in include/llama.h if it's library-level.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama-quantize – llama.cpp wiki | Factory