ggml-org/llama.cpp

llama-imatrix

llama-imatrix produces an importance matrix — per-tensor activation statistics over a calibration corpus — that llama-quantize then uses to bias IQ-quants toward channels that matter for real text. See tools/imatrix/README.md for usage details.

Purpose

Run a calibration text through a model in evaluation mode, capture activation magnitudes for every weight tensor, and write them to an .imatrix GGUF file.

Usage

llama-imatrix -m model.gguf -f calibration.txt -o model.imatrix

Common flags:

Flag	Effect
`-f path`	Calibration corpus (plain text)
`--from-chunked path`	Use a pre-chunked dataset format
`-o path`	Output `.imatrix` GGUF
`--chunks N`, `--ctx-size`, `-b`, `-ub`	Calibration knobs (chunk count, ctx size, batch sizes)
`--save-frequency N`	Periodically flush partial results
`--in-file path`	Continue from a previous run

The calibration corpus is typically wikitext-style English plus code snippets. tools/imatrix/README.md cites the conventional sources used by community quants.

How it works

graph LR
    Text[calibration text] --> Tok[tokenize]
    Tok --> Decode[llama_decode]
    Decode --> Hooks[per-tensor activation accumulators]
    Hooks --> Stats[per-channel L2 magnitude per tensor]
    Stats --> Out[.imatrix GGUF]

The implementation registers GGML tensor-evaluation callbacks via the eval-callback hook (similar to examples/eval-callback/). On every forward pass, it accumulates sum(|x|^2) across each tensor's input channel dimension. After all chunks are processed, the running sums are written to a GGUF as one tensor per matmul.

Output format

An .imatrix GGUF contains:

One tensor per source matmul, named after the source weight tensor.
Metadata recording the calibration corpus name, chunk count, and source model.
Token / chunk counts so you can resume a partial run.

Both llama-quantize --imatrix and downstream tooling read the file via the standard ggml/src/gguf.cpp reader.

Integration points

Quantization. The primary consumer; see llama-quantize and Quantization system.
Eval callbacks. The same hook used by examples/eval-callback/ to inspect tensors.
GGUF. Standard reader/writer; nothing imatrix-specific in the format.

Entry points for modification

Calibration metric. The current sum-of-squares metric is encoded in tools/imatrix/imatrix.cpp. Alternatives (max, percentile) would be one-liners there.
Per-expert imatrix. MoE models present multiple experts per layer; the existing implementation handles this by accumulating per-expert stats. Extending this for new MoE variants happens in this file.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.