ggml-org/llama.cpp
llama-imatrix
llama-imatrix produces an importance matrix — per-tensor activation statistics over a calibration corpus — that llama-quantize then uses to bias IQ-quants toward channels that matter for real text. See tools/imatrix/README.md for usage details.
Purpose
Run a calibration text through a model in evaluation mode, capture activation magnitudes for every weight tensor, and write them to an .imatrix GGUF file.
Usage
llama-imatrix -m model.gguf -f calibration.txt -o model.imatrixCommon flags:
| Flag | Effect |
|---|---|
-f path |
Calibration corpus (plain text) |
--from-chunked path |
Use a pre-chunked dataset format |
-o path |
Output .imatrix GGUF |
--chunks N, --ctx-size, -b, -ub |
Calibration knobs (chunk count, ctx size, batch sizes) |
--save-frequency N |
Periodically flush partial results |
--in-file path |
Continue from a previous run |
The calibration corpus is typically wikitext-style English plus code snippets. tools/imatrix/README.md cites the conventional sources used by community quants.
How it works
graph LR
Text[calibration text] --> Tok[tokenize]
Tok --> Decode[llama_decode]
Decode --> Hooks[per-tensor activation accumulators]
Hooks --> Stats[per-channel L2 magnitude per tensor]
Stats --> Out[.imatrix GGUF]The implementation registers GGML tensor-evaluation callbacks via the eval-callback hook (similar to examples/eval-callback/). On every forward pass, it accumulates sum(|x|^2) across each tensor's input channel dimension. After all chunks are processed, the running sums are written to a GGUF as one tensor per matmul.
Output format
An .imatrix GGUF contains:
- One tensor per source matmul, named after the source weight tensor.
- Metadata recording the calibration corpus name, chunk count, and source model.
- Token / chunk counts so you can resume a partial run.
Both llama-quantize --imatrix and downstream tooling read the file via the standard ggml/src/gguf.cpp reader.
Integration points
- Quantization. The primary consumer; see llama-quantize and Quantization system.
- Eval callbacks. The same hook used by
examples/eval-callback/to inspect tensors. - GGUF. Standard reader/writer; nothing imatrix-specific in the format.
Entry points for modification
- Calibration metric. The current sum-of-squares metric is encoded in
tools/imatrix/imatrix.cpp. Alternatives (max, percentile) would be one-liners there. - Per-expert imatrix. MoE models present multiple experts per layer; the existing implementation handles this by accumulating per-expert stats. Extending this for new MoE variants happens in this file.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama-quantize
Next
llama-perplexity