Factory.ai

Open-Source Wikis

/

llama.cpp

/

Tools

/

llama-bench

ggml-org/llama.cpp

llama-bench

llama-bench is the throughput benchmark that PRs are expected to use whenever they change inference performance. See tools/llama-bench/README.md for the long-form options.

Purpose

  • Measure prompt-processing speed (pp) — tokens/sec while ingesting a long prompt.
  • Measure token-generation speed (tg) — tokens/sec while generating one token at a time.
  • Sweep configurations: model files, threads, batch sizes, GPU layers, KV types, tensor splits.

Usage

# Default sweep on a single model
llama-bench -m model.gguf

# Custom prompt + generation lengths
llama-bench -m model.gguf -p 512 -n 128

# Sweep multiple values
llama-bench -m model.gguf -p 128,512,2048 -n 0,16,128 -t 4,8 -ngl 0,32,99

# Specific tensor split
llama-bench -m model.gguf -ngl 99 -ts 0.5,0.5

Output is a tabular ASCII report; CSV / JSON output is available via -o. Common flags:

| Flag | Effect | | --------------------- | ------------------------------ | --------------- | | -m model[,model...] | One or more models | | -p N[,N...] | Prompt lengths | | -n N[,N...] | Generation lengths | | -pg N,M | Prompt + generation combos | | -t N[,N...] | Threads | | -b, -ub | Logical / physical batch sizes | | -ngl N[,N...] | GPU layers | | -ts a,b,... | Tensor split | | -mg N | Main GPU | | -fa 0 | 1 | Flash attention | | -ctk, -ctv | KV quant types | | -r N | Repeat each measurement |

How it works

tools/llama-bench/llama-bench.cpp enumerates the user-provided sweeps as a list of bench_params, loads each model once, and for each parameter combination runs a warm-up pass followed by -r timed passes. Per-pass timing comes from ggml_time_us and per-token rates are derived from the resulting microsecond counts.

The tool deliberately runs the same llama_decode path as the production tools — there is no special "benchmark mode" inside libllama. This is what makes its numbers comparable to real workloads.

CI usage

ci/run.sh invokes llama-bench against canonical models on the self-hosted ggml-ci runners. Results are stored under benches/ (e.g. benches/dgx-spark/, benches/mac-m2-ultra/, benches/nemotron/) so contributors can compare across platforms.

Integration points

  • libllama — exercises the same llama_decode and sampler code paths as production tools.
  • Backends — sweeps -ngl and -ts to cover the GPU offload spectrum.
  • tools/results/ — small helper for parsing benchmark output across runs.

Entry points for modification

  • New parameter to sweep. Add a column to the bench_params struct and parser in tools/llama-bench/llama-bench.cpp plus a printer column in the same file.
  • New output format. Add a printer subclass next to the existing CSV/JSON/Markdown printers.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama-bench – llama.cpp wiki | Factory