Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Quantization

ggml-org/llama.cpp

Quantization

Active contributors: Georgi Gerganov, Johannes Gäßler

Quantization is what makes large models fit on small hardware, and it is one of llama.cpp's defining features. The runtime handles a deep zoo of quant types — k-quants, IQ-quants, MXFP4, the legacy Q4_0/Q5_0/Q8_0 family, and per-tensor mixes. The build pipeline that produces those quants lives in src/llama-quant.cpp and the llama-quantize tool.

Concern Where it lives
Quant type implementations (kernels, dequant, dot products) ggml/src/ggml-common.h, ggml/src/ggml-quants.c, plus per-backend specializations under ggml/src/ggml-<backend>/
Quantization driver (read source model, decide per-tensor type, write output GGUF) src/llama-quant.cpp, tools/quantize/

This page focuses on the driver. Per-format kernel details are in the GGML side; see Backends.

Purpose

  • Read a higher-precision GGUF file (typically F16, BF16, or F32).
  • For each tensor, pick a target ggml_type based on user options and (optionally) an importance matrix.
  • Write a new GGUF file with the chosen types and metadata.

Key abstractions

Type Role File
enum ggml_type The quant types (GGML_TYPE_F16, ..._Q4_0, ..._Q4_K, ..._IQ2_XS, ..._MXFP4, ...) ggml/include/ggml.h
Per-type block layout Bit-packed format for each quant ggml/src/ggml-common.h
llama_model_quantize_params User options (target type, imatrix path, output type, exclude-tensors, ...) include/llama.h
llama_quantize_internal Orchestrator that walks tensors and dispatches by type src/llama-quant.cpp
--imatrix data Per-tensor activation statistics produced by llama-imatrix tools/imatrix/

Supported quant families

Categories rather than an exhaustive list (the canonical list is enum ggml_type):

Family Examples Notes
Legacy block Q4_0, Q4_1, Q5_0, Q5_1, Q8_0 The original 32-element-block formats. Still loadable; superseded by k-quants.
k-quants Q2_K, Q3_K_S/M/L, Q4_K_S/M, Q5_K_S/M, Q6_K, Q8_K The "good defaults". Mixed precisions per tensor; _M is medium, _S is small.
IQ-quants IQ1_S, IQ1_M, IQ2_XXS, IQ2_XS, IQ2_S, IQ2_M, IQ3_XXS, IQ3_S, IQ3_M, IQ4_NL, IQ4_XS Importance-matrix-aware quants. Use with --imatrix.
Sub-byte Q4_0_4_4, Q4_0_4_8, Q4_0_8_8, IQ1_* Special blocking for SIMD-friendly dot products on ARM.
Native low-bit MXFP4 OCP MX FP4 format; added in 2025 alongside gpt-oss support.
Float F16, BF16, F32 Pass-through and "upcast" types.

How quantization runs

sequenceDiagram
    participant User
    participant Q as llama-quantize
    participant Loader as model loader
    participant Imatrix as imatrix file
    participant Driver as src/llama-quant.cpp
    participant GGML as ggml-quants.c

    User->>Q: in.gguf out.gguf Q4_K_M [--imatrix imat.gguf]
    Q->>Loader: open in.gguf (no mmap; raw read)
    Q->>Imatrix: load activations (optional)
    Q->>Driver: llama_model_quantize(in, out, params)
    loop each tensor
        Driver->>Driver: pick per-tensor target type (mix)
        Driver->>GGML: ggml_quantize_chunk(type, src, dst, ...)
        GGML-->>Driver: quantized bytes
        Driver->>Driver: write to out.gguf
    end
    Driver-->>User: report sizes + perplexity-style stats

The driver applies per-tensor mixing rules (e.g. Q4_K_M keeps the embedding and output layers at higher precision while quantizing attention/FFN weights more aggressively). The exact rules are encoded in src/llama-quant.cpp. Users can override them with --token-embedding-type, --output-tensor-type, --exclude-tensors, etc.

Importance matrix

llama-imatrix runs a calibration corpus through the model, captures per-tensor activation magnitudes, and writes them to a .imatrix GGUF file. When llama-quantize --imatrix is provided, the driver passes the activations to the IQ-quant kernels; they bias their codebooks toward channels that matter for the calibration data, improving quality at the same size.

See imatrix tool.

Integration points

  • Loader. Reads quantized tensors transparently — every backend implements per-type dot products.
  • llama-quantize. Thin CLI in tools/quantize/quantize.cpp that calls llama_model_quantize.
  • llama-perplexity. The standard quality yardstick. Run before/after quantization to validate.
  • Backends. Each backend ships its own per-type kernels (ggml/src/ggml-cuda/mmvq.cuh, ggml/src/ggml-metal/ggml-metal.metal, etc.). Adding a new type requires touching every backend that should support it.

Adding a new quant type

CONTRIBUTING.md calls this out as carrying a "disproportionate maintenance burden." The minimum bar:

  1. Define the block layout in ggml/src/ggml-common.h and the reference kernel in ggml/src/ggml-quants.c.
  2. Add it to enum ggml_type in ggml/include/ggml.h.
  3. Provide CPU dot product, dequantize, and quantize.
  4. Convert a small model and upload it to HuggingFace.
  5. Provide perplexity and KL-divergence comparisons vs the native FP16/BF16 and vs types of similar size.
  6. Provide llama-bench performance numbers on CPU.
  7. Add it to the per-tensor mixing rules in src/llama-quant.cpp for relevant _S/_M/_L recipes.
  8. Add tests/test-quantize-fns.cpp cases.

Backend-specific kernels can come in follow-up PRs.

Tests

  • tests/test-quantize-fns.cpp — round-trip and dot-product correctness vs the reference.
  • tests/test-quantize-perf.cpp — micro-benchmarks.
  • tests/test-quantize-stats.cpp — bias and error distribution.
  • tests/test-backend-ops.cpp — exercises quantized matmul on every backend.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Quantization – llama.cpp wiki | Factory