ggml-org/llama.cpp

Quantization

Active contributors: Georgi Gerganov, Johannes Gäßler

Quantization is what makes large models fit on small hardware, and it is one of llama.cpp's defining features. The runtime handles a deep zoo of quant types — k-quants, IQ-quants, MXFP4, the legacy Q4_0/Q5_0/Q8_0 family, and per-tensor mixes. The build pipeline that produces those quants lives in src/llama-quant.cpp and the llama-quantize tool.

Concern	Where it lives
Quant type implementations (kernels, dequant, dot products)	`ggml/src/ggml-common.h`, `ggml/src/ggml-quants.c`, plus per-backend specializations under `ggml/src/ggml-<backend>/`
Quantization driver (read source model, decide per-tensor type, write output GGUF)	`src/llama-quant.cpp`, `tools/quantize/`

This page focuses on the driver. Per-format kernel details are in the GGML side; see Backends.

Purpose

Read a higher-precision GGUF file (typically F16, BF16, or F32).
For each tensor, pick a target ggml_type based on user options and (optionally) an importance matrix.
Write a new GGUF file with the chosen types and metadata.

Key abstractions

Type	Role	File
`enum ggml_type`	The quant types (`GGML_TYPE_F16`, `..._Q4_0`, `..._Q4_K`, `..._IQ2_XS`, `..._MXFP4`, ...)	`ggml/include/ggml.h`
Per-type block layout	Bit-packed format for each quant	`ggml/src/ggml-common.h`
`llama_model_quantize_params`	User options (target type, imatrix path, output type, exclude-tensors, ...)	`include/llama.h`
`llama_quantize_internal`	Orchestrator that walks tensors and dispatches by type	`src/llama-quant.cpp`
`--imatrix` data	Per-tensor activation statistics produced by `llama-imatrix`	`tools/imatrix/`

Supported quant families

Categories rather than an exhaustive list (the canonical list is enum ggml_type):

Family	Examples	Notes
Legacy block	`Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q8_0`	The original 32-element-block formats. Still loadable; superseded by k-quants.
k-quants	`Q2_K`, `Q3_K_S/M/L`, `Q4_K_S/M`, `Q5_K_S/M`, `Q6_K`, `Q8_K`	The "good defaults". Mixed precisions per tensor; `_M` is medium, `_S` is small.
IQ-quants	`IQ1_S`, `IQ1_M`, `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ2_M`, `IQ3_XXS`, `IQ3_S`, `IQ3_M`, `IQ4_NL`, `IQ4_XS`	Importance-matrix-aware quants. Use with `--imatrix`.
Sub-byte	`Q4_0_4_4`, `Q4_0_4_8`, `Q4_0_8_8`, `IQ1_*`	Special blocking for SIMD-friendly dot products on ARM.
Native low-bit	`MXFP4`	OCP MX FP4 format; added in 2025 alongside `gpt-oss` support.
Float	`F16`, `BF16`, `F32`	Pass-through and "upcast" types.

How quantization runs

sequenceDiagram
    participant User
    participant Q as llama-quantize
    participant Loader as model loader
    participant Imatrix as imatrix file
    participant Driver as src/llama-quant.cpp
    participant GGML as ggml-quants.c

    User->>Q: in.gguf out.gguf Q4_K_M [--imatrix imat.gguf]
    Q->>Loader: open in.gguf (no mmap; raw read)
    Q->>Imatrix: load activations (optional)
    Q->>Driver: llama_model_quantize(in, out, params)
    loop each tensor
        Driver->>Driver: pick per-tensor target type (mix)
        Driver->>GGML: ggml_quantize_chunk(type, src, dst, ...)
        GGML-->>Driver: quantized bytes
        Driver->>Driver: write to out.gguf
    end
    Driver-->>User: report sizes + perplexity-style stats

The driver applies per-tensor mixing rules (e.g. Q4_K_M keeps the embedding and output layers at higher precision while quantizing attention/FFN weights more aggressively). The exact rules are encoded in src/llama-quant.cpp. Users can override them with --token-embedding-type, --output-tensor-type, --exclude-tensors, etc.

Importance matrix

llama-imatrix runs a calibration corpus through the model, captures per-tensor activation magnitudes, and writes them to a .imatrix GGUF file. When llama-quantize --imatrix is provided, the driver passes the activations to the IQ-quant kernels; they bias their codebooks toward channels that matter for the calibration data, improving quality at the same size.

See imatrix tool.

Integration points

Loader. Reads quantized tensors transparently — every backend implements per-type dot products.
llama-quantize. Thin CLI in tools/quantize/quantize.cpp that calls llama_model_quantize.
llama-perplexity. The standard quality yardstick. Run before/after quantization to validate.
Backends. Each backend ships its own per-type kernels (ggml/src/ggml-cuda/mmvq.cuh, ggml/src/ggml-metal/ggml-metal.metal, etc.). Adding a new type requires touching every backend that should support it.

Adding a new quant type

CONTRIBUTING.md calls this out as carrying a "disproportionate maintenance burden." The minimum bar:

Define the block layout in ggml/src/ggml-common.h and the reference kernel in ggml/src/ggml-quants.c.
Add it to enum ggml_type in ggml/include/ggml.h.
Provide CPU dot product, dequantize, and quantize.
Convert a small model and upload it to HuggingFace.
Provide perplexity and KL-divergence comparisons vs the native FP16/BF16 and vs types of similar size.
Provide llama-bench performance numbers on CPU.
Add it to the per-tensor mixing rules in src/llama-quant.cpp for relevant _S/_M/_L recipes.
Add tests/test-quantize-fns.cpp cases.

Backend-specific kernels can come in follow-up PRs.

Tests

tests/test-quantize-fns.cpp — round-trip and dot-product correctness vs the reference.
tests/test-quantize-perf.cpp — micro-benchmarks.
tests/test-quantize-stats.cpp — bias and error distribution.
tests/test-backend-ops.cpp — exercises quantized matmul on every backend.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.