ggml-org/llama.cpp
Debugging
Active contributors: Georgi Gerganov, Daniel Bevenius
A grab bag of techniques and entry points for tracking down failures in llama.cpp.
Logging
Logging is centralized in common/log.cpp (with the public macros in common/log.h). Most binaries accept -v / --verbose and --log-disable.
libllama itself uses LLAMA_LOG_* macros (src/llama-impl.cpp, src/llama-impl.h) which call into a callback registered via llama_log_set. Tools register their own callback through common/log.cpp. Set GGML_LOG_LEVEL, LLAMA_LOG_LEVEL, or GGML_BACKEND_LOG_LEVEL to bump verbosity.
For backend kernel selection, ggml-backend-reg.cpp logs which devices were registered and which were chosen. This is the first thing to check when "the GPU is not being used."
Common failure modes
| Symptom | Likely cause | First place to look |
|---|---|---|
error loading model: ... unknown architecture |
Model arch not in LLM_ARCH_* enum |
src/llama-arch.cpp, convert_hf_to_gguf.py |
| Garbage output / NaN logits on GPU | Backend op mismatch with CPU reference | Run test-backend-ops; check ggml/src/ggml-<backend>/<op>.* |
| Tokenizer mismatch with HF reference | Wrong vocab type or pre-tokenizer | src/llama-vocab.cpp, tests/test-tokenizer-* |
| Slow prompt processing | Missing BLAS / wrong number of threads | -t N, GGML_BLAS=ON, see ggml/src/ggml-blas/ |
llama-server 503 / "no slot available" |
All KV slots busy | tools/server/server-context.cpp slot loop, --parallel N |
Killed during model load |
OOM | Check --n-gpu-layers, --mlock, --no-mmap, --cache-type-k/-v |
unknown chat template |
Template not embedded in GGUF and none supplied | src/llama-chat.cpp, --chat-template-file |
| Weird CUDA error on multi-GPU | Tensor split misconfigured | --tensor-split, --main-gpu, ggml/src/ggml-cuda/ |
terminate called after throwing 'std::runtime_error' |
Often a malformed GGUF or missing tensor | Run with --verbose; ggml/src/gguf.cpp, src/llama-model-loader.cpp |
Useful flags for debugging inference
| Flag | What it shows |
|---|---|
-v / --verbose |
Per-step logging (tokenizer output, prompt eval) |
--logits-all |
Return logits for every token, not just the last |
--no-warmup |
Skip the warm-up batch — useful when chasing first-decode issues |
--check-tensors |
llama-cli flag that validates tensor numerics on load |
--gpu-layers N |
Try N=0 to confirm CPU-only behavior |
LLAMA_LOG_LEVEL=DEBUG |
Bump the log level globally |
GGML_BACKEND_LOG_LEVEL=info |
Show backend selection / kernel scheduling |
Tools that exist for debugging
examples/eval-callback/— registers a callback that fires after each tensor evaluation, useful for inspecting intermediate activations.examples/debug/— a scratchpad for ad-hoc tensor experiments.tools/mtmd/debug/— diagnostics for the multimodal stack.examples/gguf/— inspect a GGUF file's metadata and tensor layout.examples/gguf-hash/— hash and verify GGUF tensor data.
Sanitizers
The CMake build supports the address and undefined sanitizers via -DCMAKE_BUILD_TYPE=Debug plus -fsanitize=address,undefined flags. CI runs an ASan/UBSan build on every PR (.github/workflows/build.yml).
Reading a stack trace
libllama uses GGML_ASSERT and LLAMA_ASSERT (defined in ggml/src/ggml-impl.h and src/llama-impl.h). Both abort the process and print the file:line. When triaging crashes, the first frame in the assert message is usually exactly the right place to start.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Testing
Next
Patterns and conventions