ggml-org/llama.cpp

Testing

Active contributors: Georgi Gerganov, Johannes Gäßler

llama.cpp tests run through CMake's CTest integration. The fast suite targets correctness; a long-form CI under ci/ exercises end-to-end performance and quality on self-hosted runners.

Fast tests (CTest)

Configure with tests enabled, then run:

cmake -B build -DLLAMA_BUILD_TESTS=ON
cmake --build build --config Release -j
ctest --test-dir build --output-on-failure

Where they live: tests/. Highlights:

Test	What it covers
`test-tokenizer-0`, `test-tokenizer-1-*`	BPE / SPM / WPM / UGM / RWKV round-trips against a reference tokenizer
`test-vocab`	Vocab loader edge cases
`test-grammar-parser`, `test-grammar-integration`, `test-grammar-llguidance`	GBNF parser and the optional llguidance backend
`test-json-schema-to-grammar`	Schema → GBNF conversion
`test-chat`, `test-chat-template`, `test-chat-parser`	Chat templating and tool/function-call parsing
`test-llama-archs`	Architecture switch in `llama-arch.cpp`
`test-sampling`, `test-mtmd-c-api`	Sampler chains, MTMD C-API
`test-quantize-fns`, `test-quantize-perf`, `test-quantize-stats`	Quantization correctness and perf vs reference
`test-backend-ops`	The headline test — every `ggml` op compared across every loaded backend against the CPU reference
`test-thread-safety`, `test-mmap`, `test-arg-parser`, `test-log`	Misc plumbing
`tests/peg-parser/`	PEG parser snapshots and behavior

tests/snapshots/ and tests/peg-parser/snapshots/ hold golden outputs for the parser tests.

tests/test-backend-ops.cpp is the cross-backend conformance test. It enumerates every ggml_op, runs it through the CPU implementation and through every other registered backend, and compares results within a numerical tolerance. If you change a backend kernel, run this test against at least two backends to catch regressions:

GGML_BACKEND_LOG_LEVEL=info ./build/bin/test-backend-ops

CONTRIBUTING.md calls this test out explicitly: "If you modified a ggml operator or added a new one, add the corresponding test cases to test-backend-ops."

Long-form CI (`ci/`)

ci/ houses scripts for the self-hosted ggml-ci runners that exercise:

Multi-backend builds (CUDA, Metal, Vulkan, SYCL, HIP, ...)
Real model downloads
Perplexity runs against reference models
Throughput benchmarks via llama-bench

ci/README.md documents the entry points. The relevant runner labels appear in .github/workflows/. Maintainers manually trigger long CI on PRs with the ggml-ci label.

Server tests

tools/server/tests/ is a Python pytest suite that boots a llama-server, hits its HTTP endpoints, and verifies behavior end-to-end. Run it after changes to anything under tools/server/:

cd tools/server/tests
pip install -r requirements.txt
pytest -x -v

Multimodal tests

tools/mtmd/tests.sh is a shell driver that downloads a small multimodal model, runs llama-mtmd-cli against tools/mtmd/test-1.jpeg and test-2.mp3, and checks the output. Use it after changes under tools/mtmd/.

Performance & quality benchmarks

Two binaries are the standard yardsticks and are both expected as evidence in PRs that touch numerical code:

llama-bench — token-throughput, prompt-processing, and generation benchmarks, with multi-GPU and per-backend modes. See tools/llama-bench/README.md.
llama-perplexity — perplexity, KL divergence, and HellaSwag-style accuracy on a reference dataset. See tools/perplexity/README.md.

When you change quantization or any kernel that affects numerics, post llama-perplexity and llama-bench numbers in your PR.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Fast tests (CTest)

test-backend-ops

Long-form CI (ci/)

Server tests

Multimodal tests

Performance & quality benchmarks

Long-form CI (`ci/`)