ggml-org/llama.cpp
llama-perplexity
Active contributors: Georgi Gerganov, Johannes Gäßler
llama-perplexity is the project's reference yardstick for model quality. It computes perplexity, KL divergence vs another model, and various accuracy metrics on standard datasets. PRs that change quantization or any kernel that affects numerics are expected to ship perplexity numbers from this tool.
See tools/perplexity/README.md for full usage and CLI flags.
Purpose
- Compute next-token perplexity on a fixed corpus (typically wikitext).
- Compute KL divergence between two models on the same corpus (e.g. F16 vs Q4_K_M).
- Run multiple-choice accuracy benchmarks: HellaSwag, ARC-easy/challenge, Winogrande, MMLU, etc.
Usage
# Plain perplexity
llama-perplexity -m model.gguf -f wikitext-2-raw-v1/wiki.test.raw
# KL divergence vs a reference model
llama-perplexity -m candidate.gguf --kl-divergence-base reference.gguf -f corpus.txt
# HellaSwag accuracy
llama-perplexity -m model.gguf --hellaswag --hellaswag-tasks 400 -f hellaswag.parquet
# MMLU
llama-perplexity -m model.gguf --multiple-choice -f mmlu.binNotable flags:
| Flag | Effect |
|---|---|
-f path |
Plain text corpus |
--kl-divergence-base path |
Reference model for KL-div |
--hellaswag / --hellaswag-tasks N |
HellaSwag mode |
--winogrande, --multiple-choice |
Other multi-choice modes |
--ctx-size, -b, -ub, -c |
Context / batch sizes |
--ppl-stride N, --ppl-output-type |
Stride and output formatting |
Mathematics
Perplexity per chunk is computed as exp(mean negative log-likelihood) over the chunk's predicted-token log-probs. The implementation in tools/perplexity/perplexity.cpp runs the model in batched mode, captures logits with --logits-all, and accumulates per-token log-probs.
KL divergence compares two models token-by-token:
KL(p || q) = sum_i p_i * (log p_i - log q_i)over the full vocab, where p is the reference model and q is the candidate.
Conventional usage
The community convention is to compare quantized models against the F16/BF16 baseline using the same wikitext-2 raw test split. PRs typically post a small table:
| Model | PPL @ 4096 ctx |
|---|---|
| F16 | 5.1234 ± 0.0123 |
| Q4_K_M | 5.1567 ± 0.0123 |
| ... | ... |
For new quant types, the additional KL-divergence comparison is required by CONTRIBUTING.md.
Integration points
libllama— runsllama_decodewith--logits-alland reads the full logit tensor.- Datasets. Wikitext, HellaSwag, Winogrande, ARC, MMLU. The repo doesn't ship the data; users download it.
- Bench pipelines.
ci/run.shinvokesllama-perplexityas part of nightly validation on self-hosted runners.
Entry points for modification
- New benchmark mode. Add a switch + handler in
tools/perplexity/perplexity.cpp. Existing modes (HellaSwag, Winogrande, MMLU) are good templates. - Different metric. Most metrics are pure functions over the model logits; add them as a new mode rather than perturbing existing ones.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama-imatrix
Next
llama-bench