ggml-org/llama.cpp
llama-bench
llama-bench is the throughput benchmark that PRs are expected to use whenever they change inference performance. See tools/llama-bench/README.md for the long-form options.
Purpose
- Measure prompt-processing speed (
pp) — tokens/sec while ingesting a long prompt. - Measure token-generation speed (
tg) — tokens/sec while generating one token at a time. - Sweep configurations: model files, threads, batch sizes, GPU layers, KV types, tensor splits.
Usage
# Default sweep on a single model
llama-bench -m model.gguf
# Custom prompt + generation lengths
llama-bench -m model.gguf -p 512 -n 128
# Sweep multiple values
llama-bench -m model.gguf -p 128,512,2048 -n 0,16,128 -t 4,8 -ngl 0,32,99
# Specific tensor split
llama-bench -m model.gguf -ngl 99 -ts 0.5,0.5Output is a tabular ASCII report; CSV / JSON output is available via -o. Common flags:
| Flag | Effect |
| --------------------- | ------------------------------ | --------------- |
| -m model[,model...] | One or more models |
| -p N[,N...] | Prompt lengths |
| -n N[,N...] | Generation lengths |
| -pg N,M | Prompt + generation combos |
| -t N[,N...] | Threads |
| -b, -ub | Logical / physical batch sizes |
| -ngl N[,N...] | GPU layers |
| -ts a,b,... | Tensor split |
| -mg N | Main GPU |
| -fa 0 | 1 | Flash attention |
| -ctk, -ctv | KV quant types |
| -r N | Repeat each measurement |
How it works
tools/llama-bench/llama-bench.cpp enumerates the user-provided sweeps as a list of bench_params, loads each model once, and for each parameter combination runs a warm-up pass followed by -r timed passes. Per-pass timing comes from ggml_time_us and per-token rates are derived from the resulting microsecond counts.
The tool deliberately runs the same llama_decode path as the production tools — there is no special "benchmark mode" inside libllama. This is what makes its numbers comparable to real workloads.
CI usage
ci/run.sh invokes llama-bench against canonical models on the self-hosted ggml-ci runners. Results are stored under benches/ (e.g. benches/dgx-spark/, benches/mac-m2-ultra/, benches/nemotron/) so contributors can compare across platforms.
Integration points
libllama— exercises the samellama_decodeand sampler code paths as production tools.- Backends — sweeps
-ngland-tsto cover the GPU offload spectrum. tools/results/— small helper for parsing benchmark output across runs.
Entry points for modification
- New parameter to sweep. Add a column to the
bench_paramsstruct and parser intools/llama-bench/llama-bench.cppplus a printer column in the same file. - New output format. Add a printer subclass next to the existing CSV/JSON/Markdown printers.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama-perplexity
Next
Multimodal (mtmd)