Factory.ai

Open-Source Wikis

/

llama.cpp

/

How to contribute

/

Debugging

ggml-org/llama.cpp

Debugging

Active contributors: Georgi Gerganov, Daniel Bevenius

A grab bag of techniques and entry points for tracking down failures in llama.cpp.

Logging

Logging is centralized in common/log.cpp (with the public macros in common/log.h). Most binaries accept -v / --verbose and --log-disable.

libllama itself uses LLAMA_LOG_* macros (src/llama-impl.cpp, src/llama-impl.h) which call into a callback registered via llama_log_set. Tools register their own callback through common/log.cpp. Set GGML_LOG_LEVEL, LLAMA_LOG_LEVEL, or GGML_BACKEND_LOG_LEVEL to bump verbosity.

For backend kernel selection, ggml-backend-reg.cpp logs which devices were registered and which were chosen. This is the first thing to check when "the GPU is not being used."

Common failure modes

Symptom Likely cause First place to look
error loading model: ... unknown architecture Model arch not in LLM_ARCH_* enum src/llama-arch.cpp, convert_hf_to_gguf.py
Garbage output / NaN logits on GPU Backend op mismatch with CPU reference Run test-backend-ops; check ggml/src/ggml-<backend>/<op>.*
Tokenizer mismatch with HF reference Wrong vocab type or pre-tokenizer src/llama-vocab.cpp, tests/test-tokenizer-*
Slow prompt processing Missing BLAS / wrong number of threads -t N, GGML_BLAS=ON, see ggml/src/ggml-blas/
llama-server 503 / "no slot available" All KV slots busy tools/server/server-context.cpp slot loop, --parallel N
Killed during model load OOM Check --n-gpu-layers, --mlock, --no-mmap, --cache-type-k/-v
unknown chat template Template not embedded in GGUF and none supplied src/llama-chat.cpp, --chat-template-file
Weird CUDA error on multi-GPU Tensor split misconfigured --tensor-split, --main-gpu, ggml/src/ggml-cuda/
terminate called after throwing 'std::runtime_error' Often a malformed GGUF or missing tensor Run with --verbose; ggml/src/gguf.cpp, src/llama-model-loader.cpp

Useful flags for debugging inference

Flag What it shows
-v / --verbose Per-step logging (tokenizer output, prompt eval)
--logits-all Return logits for every token, not just the last
--no-warmup Skip the warm-up batch — useful when chasing first-decode issues
--check-tensors llama-cli flag that validates tensor numerics on load
--gpu-layers N Try N=0 to confirm CPU-only behavior
LLAMA_LOG_LEVEL=DEBUG Bump the log level globally
GGML_BACKEND_LOG_LEVEL=info Show backend selection / kernel scheduling

Tools that exist for debugging

  • examples/eval-callback/ — registers a callback that fires after each tensor evaluation, useful for inspecting intermediate activations.
  • examples/debug/ — a scratchpad for ad-hoc tensor experiments.
  • tools/mtmd/debug/ — diagnostics for the multimodal stack.
  • examples/gguf/ — inspect a GGUF file's metadata and tensor layout.
  • examples/gguf-hash/ — hash and verify GGUF tensor data.

Sanitizers

The CMake build supports the address and undefined sanitizers via -DCMAKE_BUILD_TYPE=Debug plus -fsanitize=address,undefined flags. CI runs an ASan/UBSan build on every PR (.github/workflows/build.yml).

Reading a stack trace

libllama uses GGML_ASSERT and LLAMA_ASSERT (defined in ggml/src/ggml-impl.h and src/llama-impl.h). Both abort the process and print the file:line. When triaging crashes, the first frame in the assert message is usually exactly the right place to start.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Debugging – llama.cpp wiki | Factory