ggml-org/llama.cpp

llama-perplexity

Active contributors: Georgi Gerganov, Johannes Gäßler

llama-perplexity is the project's reference yardstick for model quality. It computes perplexity, KL divergence vs another model, and various accuracy metrics on standard datasets. PRs that change quantization or any kernel that affects numerics are expected to ship perplexity numbers from this tool.

See tools/perplexity/README.md for full usage and CLI flags.

Purpose

Compute next-token perplexity on a fixed corpus (typically wikitext).
Compute KL divergence between two models on the same corpus (e.g. F16 vs Q4_K_M).
Run multiple-choice accuracy benchmarks: HellaSwag, ARC-easy/challenge, Winogrande, MMLU, etc.

Usage

# Plain perplexity
llama-perplexity -m model.gguf -f wikitext-2-raw-v1/wiki.test.raw

# KL divergence vs a reference model
llama-perplexity -m candidate.gguf --kl-divergence-base reference.gguf -f corpus.txt

# HellaSwag accuracy
llama-perplexity -m model.gguf --hellaswag --hellaswag-tasks 400 -f hellaswag.parquet

# MMLU
llama-perplexity -m model.gguf --multiple-choice -f mmlu.bin

Notable flags:

Flag	Effect
`-f path`	Plain text corpus
`--kl-divergence-base path`	Reference model for KL-div
`--hellaswag` / `--hellaswag-tasks N`	HellaSwag mode
`--winogrande`, `--multiple-choice`	Other multi-choice modes
`--ctx-size`, `-b`, `-ub`, `-c`	Context / batch sizes
`--ppl-stride N`, `--ppl-output-type`	Stride and output formatting

Mathematics

Perplexity per chunk is computed as exp(mean negative log-likelihood) over the chunk's predicted-token log-probs. The implementation in tools/perplexity/perplexity.cpp runs the model in batched mode, captures logits with --logits-all, and accumulates per-token log-probs.

KL divergence compares two models token-by-token:

KL(p || q) = sum_i p_i * (log p_i - log q_i)

over the full vocab, where p is the reference model and q is the candidate.

Conventional usage

The community convention is to compare quantized models against the F16/BF16 baseline using the same wikitext-2 raw test split. PRs typically post a small table:

Model	PPL @ 4096 ctx
F16	5.1234 ± 0.0123
Q4_K_M	5.1567 ± 0.0123
...	...

For new quant types, the additional KL-divergence comparison is required by CONTRIBUTING.md.

Integration points

libllama — runs llama_decode with --logits-all and reads the full logit tensor.
Datasets. Wikitext, HellaSwag, Winogrande, ARC, MMLU. The repo doesn't ship the data; users download it.
Bench pipelines. ci/run.sh invokes llama-perplexity as part of nightly validation on self-hosted runners.

Entry points for modification

New benchmark mode. Add a switch + handler in tools/perplexity/perplexity.cpp. Existing modes (HellaSwag, Winogrande, MMLU) are good templates.
Different metric. Most metrics are pure functions over the model logits; add them as a new mode rather than perturbing existing ones.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.