Factory.ai

Open-Source Wikis

/

llama.cpp

/

Tools

/

llama-imatrix

ggml-org/llama.cpp

llama-imatrix

llama-imatrix produces an importance matrix — per-tensor activation statistics over a calibration corpus — that llama-quantize then uses to bias IQ-quants toward channels that matter for real text. See tools/imatrix/README.md for usage details.

Purpose

Run a calibration text through a model in evaluation mode, capture activation magnitudes for every weight tensor, and write them to an .imatrix GGUF file.

Usage

llama-imatrix -m model.gguf -f calibration.txt -o model.imatrix

Common flags:

Flag Effect
-f path Calibration corpus (plain text)
--from-chunked path Use a pre-chunked dataset format
-o path Output .imatrix GGUF
--chunks N, --ctx-size, -b, -ub Calibration knobs (chunk count, ctx size, batch sizes)
--save-frequency N Periodically flush partial results
--in-file path Continue from a previous run

The calibration corpus is typically wikitext-style English plus code snippets. tools/imatrix/README.md cites the conventional sources used by community quants.

How it works

graph LR
    Text[calibration text] --> Tok[tokenize]
    Tok --> Decode[llama_decode]
    Decode --> Hooks[per-tensor activation accumulators]
    Hooks --> Stats[per-channel L2 magnitude per tensor]
    Stats --> Out[.imatrix GGUF]

The implementation registers GGML tensor-evaluation callbacks via the eval-callback hook (similar to examples/eval-callback/). On every forward pass, it accumulates sum(|x|^2) across each tensor's input channel dimension. After all chunks are processed, the running sums are written to a GGUF as one tensor per matmul.

Output format

An .imatrix GGUF contains:

  • One tensor per source matmul, named after the source weight tensor.
  • Metadata recording the calibration corpus name, chunk count, and source model.
  • Token / chunk counts so you can resume a partial run.

Both llama-quantize --imatrix and downstream tooling read the file via the standard ggml/src/gguf.cpp reader.

Integration points

  • Quantization. The primary consumer; see llama-quantize and Quantization system.
  • Eval callbacks. The same hook used by examples/eval-callback/ to inspect tensors.
  • GGUF. Standard reader/writer; nothing imatrix-specific in the format.

Entry points for modification

  • Calibration metric. The current sum-of-squares metric is encoded in tools/imatrix/imatrix.cpp. Alternatives (max, percentile) would be one-liners there.
  • Per-expert imatrix. MoE models present multiple experts per layer; the existing implementation handles this by accumulating per-expert stats. Extending this for new MoE variants happens in this file.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama-imatrix – llama.cpp wiki | Factory