ggml-org/llama.cpp
Quantization
Active contributors: Georgi Gerganov, Johannes Gäßler
Quantization is what makes large models fit on small hardware, and it is one of llama.cpp's defining features. The runtime handles a deep zoo of quant types — k-quants, IQ-quants, MXFP4, the legacy Q4_0/Q5_0/Q8_0 family, and per-tensor mixes. The build pipeline that produces those quants lives in src/llama-quant.cpp and the llama-quantize tool.
Two related concerns
| Concern | Where it lives |
|---|---|
| Quant type implementations (kernels, dequant, dot products) | ggml/src/ggml-common.h, ggml/src/ggml-quants.c, plus per-backend specializations under ggml/src/ggml-<backend>/ |
| Quantization driver (read source model, decide per-tensor type, write output GGUF) | src/llama-quant.cpp, tools/quantize/ |
This page focuses on the driver. Per-format kernel details are in the GGML side; see Backends.
Purpose
- Read a higher-precision GGUF file (typically F16, BF16, or F32).
- For each tensor, pick a target
ggml_typebased on user options and (optionally) an importance matrix. - Write a new GGUF file with the chosen types and metadata.
Key abstractions
| Type | Role | File |
|---|---|---|
enum ggml_type |
The quant types (GGML_TYPE_F16, ..._Q4_0, ..._Q4_K, ..._IQ2_XS, ..._MXFP4, ...) |
ggml/include/ggml.h |
| Per-type block layout | Bit-packed format for each quant | ggml/src/ggml-common.h |
llama_model_quantize_params |
User options (target type, imatrix path, output type, exclude-tensors, ...) | include/llama.h |
llama_quantize_internal |
Orchestrator that walks tensors and dispatches by type | src/llama-quant.cpp |
--imatrix data |
Per-tensor activation statistics produced by llama-imatrix |
tools/imatrix/ |
Supported quant families
Categories rather than an exhaustive list (the canonical list is enum ggml_type):
| Family | Examples | Notes |
|---|---|---|
| Legacy block | Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 |
The original 32-element-block formats. Still loadable; superseded by k-quants. |
| k-quants | Q2_K, Q3_K_S/M/L, Q4_K_S/M, Q5_K_S/M, Q6_K, Q8_K |
The "good defaults". Mixed precisions per tensor; _M is medium, _S is small. |
| IQ-quants | IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_S, IQ3_M, IQ4_NL, IQ4_XS |
Importance-matrix-aware quants. Use with --imatrix. |
| Sub-byte | Q4_0_4_4, Q4_0_4_8, Q4_0_8_8, IQ1_* |
Special blocking for SIMD-friendly dot products on ARM. |
| Native low-bit | MXFP4 |
OCP MX FP4 format; added in 2025 alongside gpt-oss support. |
| Float | F16, BF16, F32 |
Pass-through and "upcast" types. |
How quantization runs
sequenceDiagram
participant User
participant Q as llama-quantize
participant Loader as model loader
participant Imatrix as imatrix file
participant Driver as src/llama-quant.cpp
participant GGML as ggml-quants.c
User->>Q: in.gguf out.gguf Q4_K_M [--imatrix imat.gguf]
Q->>Loader: open in.gguf (no mmap; raw read)
Q->>Imatrix: load activations (optional)
Q->>Driver: llama_model_quantize(in, out, params)
loop each tensor
Driver->>Driver: pick per-tensor target type (mix)
Driver->>GGML: ggml_quantize_chunk(type, src, dst, ...)
GGML-->>Driver: quantized bytes
Driver->>Driver: write to out.gguf
end
Driver-->>User: report sizes + perplexity-style statsThe driver applies per-tensor mixing rules (e.g. Q4_K_M keeps the embedding and output layers at higher precision while quantizing attention/FFN weights more aggressively). The exact rules are encoded in src/llama-quant.cpp. Users can override them with --token-embedding-type, --output-tensor-type, --exclude-tensors, etc.
Importance matrix
llama-imatrix runs a calibration corpus through the model, captures per-tensor activation magnitudes, and writes them to a .imatrix GGUF file. When llama-quantize --imatrix is provided, the driver passes the activations to the IQ-quant kernels; they bias their codebooks toward channels that matter for the calibration data, improving quality at the same size.
See imatrix tool.
Integration points
- Loader. Reads quantized tensors transparently — every backend implements per-type dot products.
llama-quantize. Thin CLI intools/quantize/quantize.cppthat callsllama_model_quantize.llama-perplexity. The standard quality yardstick. Run before/after quantization to validate.- Backends. Each backend ships its own per-type kernels (
ggml/src/ggml-cuda/mmvq.cuh,ggml/src/ggml-metal/ggml-metal.metal, etc.). Adding a new type requires touching every backend that should support it.
Adding a new quant type
CONTRIBUTING.md calls this out as carrying a "disproportionate maintenance burden." The minimum bar:
- Define the block layout in
ggml/src/ggml-common.hand the reference kernel inggml/src/ggml-quants.c. - Add it to
enum ggml_typeinggml/include/ggml.h. - Provide CPU dot product, dequantize, and quantize.
- Convert a small model and upload it to HuggingFace.
- Provide perplexity and KL-divergence comparisons vs the native FP16/BF16 and vs types of similar size.
- Provide
llama-benchperformance numbers on CPU. - Add it to the per-tensor mixing rules in
src/llama-quant.cppfor relevant_S/_M/_Lrecipes. - Add
tests/test-quantize-fns.cppcases.
Backend-specific kernels can come in follow-up PRs.
Tests
tests/test-quantize-fns.cpp— round-trip and dot-product correctness vs the reference.tests/test-quantize-perf.cpp— micro-benchmarks.tests/test-quantize-stats.cpp— bias and error distribution.tests/test-backend-ops.cpp— exercises quantized matmul on every backend.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Chat templates
Next
Adapters