ggml-org/llama.cpp

Debugging

Active contributors: Georgi Gerganov, Daniel Bevenius

A grab bag of techniques and entry points for tracking down failures in llama.cpp.

Logging

Logging is centralized in common/log.cpp (with the public macros in common/log.h). Most binaries accept -v / --verbose and --log-disable.

libllama itself uses LLAMA_LOG_* macros (src/llama-impl.cpp, src/llama-impl.h) which call into a callback registered via llama_log_set. Tools register their own callback through common/log.cpp. Set GGML_LOG_LEVEL, LLAMA_LOG_LEVEL, or GGML_BACKEND_LOG_LEVEL to bump verbosity.

For backend kernel selection, ggml-backend-reg.cpp logs which devices were registered and which were chosen. This is the first thing to check when "the GPU is not being used."

Common failure modes

Symptom	Likely cause	First place to look
`error loading model: ... unknown architecture`	Model arch not in `LLM_ARCH_*` enum	`src/llama-arch.cpp`, `convert_hf_to_gguf.py`
Garbage output / NaN logits on GPU	Backend op mismatch with CPU reference	Run `test-backend-ops`; check `ggml/src/ggml-<backend>/<op>.*`
Tokenizer mismatch with HF reference	Wrong vocab type or pre-tokenizer	`src/llama-vocab.cpp`, `tests/test-tokenizer-*`
Slow prompt processing	Missing BLAS / wrong number of threads	`-t N`, `GGML_BLAS=ON`, see `ggml/src/ggml-blas/`
`llama-server` 503 / "no slot available"	All KV slots busy	`tools/server/server-context.cpp` slot loop, `--parallel N`
`Killed` during model load	OOM	Check `--n-gpu-layers`, `--mlock`, `--no-mmap`, `--cache-type-k`/`-v`
`unknown chat template`	Template not embedded in GGUF and none supplied	`src/llama-chat.cpp`, `--chat-template-file`
Weird CUDA error on multi-GPU	Tensor split misconfigured	`--tensor-split`, `--main-gpu`, `ggml/src/ggml-cuda/`
`terminate called after throwing 'std::runtime_error'`	Often a malformed GGUF or missing tensor	Run with `--verbose`; `ggml/src/gguf.cpp`, `src/llama-model-loader.cpp`

Useful flags for debugging inference

Flag	What it shows
`-v` / `--verbose`	Per-step logging (tokenizer output, prompt eval)
`--logits-all`	Return logits for every token, not just the last
`--no-warmup`	Skip the warm-up batch — useful when chasing first-decode issues
`--check-tensors`	`llama-cli` flag that validates tensor numerics on load
`--gpu-layers N`	Try `N=0` to confirm CPU-only behavior
`LLAMA_LOG_LEVEL=DEBUG`	Bump the log level globally
`GGML_BACKEND_LOG_LEVEL=info`	Show backend selection / kernel scheduling

Tools that exist for debugging

examples/eval-callback/ — registers a callback that fires after each tensor evaluation, useful for inspecting intermediate activations.
examples/debug/ — a scratchpad for ad-hoc tensor experiments.
tools/mtmd/debug/ — diagnostics for the multimodal stack.
examples/gguf/ — inspect a GGUF file's metadata and tensor layout.
examples/gguf-hash/ — hash and verify GGUF tensor data.

Sanitizers

The CMake build supports the address and undefined sanitizers via -DCMAKE_BUILD_TYPE=Debug plus -fsanitize=address,undefined flags. CI runs an ASan/UBSan build on every PR (.github/workflows/build.yml).

Reading a stack trace

libllama uses GGML_ASSERT and LLAMA_ASSERT (defined in ggml/src/ggml-impl.h and src/llama-impl.h). Both abort the process and print the file:line. When triaging crashes, the first frame in the assert message is usually exactly the right place to start.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.