ggml-org/llama.cpp
Library entry point
Active contributors: Georgi Gerganov
src/llama.cpp is small and acts as glue. The bulk of the library lives in sibling files (llama-model.cpp, llama-context.cpp, llama-vocab.cpp, ...). This page describes how the entry point wires everything together so newcomers can find their way around.
Purpose
Bind the public C API declared in include/llama.h to the internal C++ subsystems, register the GGML backends, and own the global init/free machinery.
Key entry points (in include/llama.h)
| Public symbol | Bound to |
|---|---|
llama_backend_init / llama_backend_free |
Global GGML init via ggml_backend_load_all (src/llama.cpp) |
llama_model_load_from_file / llama_model_load_from_splits |
llama_model_loader::load_* in src/llama-model-loader.cpp |
llama_model_free |
llama_model::~llama_model in src/llama-model.cpp |
llama_init_from_model |
llama_context constructor in src/llama-context.cpp |
llama_free |
llama_context::~llama_context |
llama_decode, llama_encode |
llama_context::decode / encode (graph build → schedule → run) |
llama_get_logits*, llama_get_embeddings* |
llama_context accessors |
llama_kv_*, llama_memory_* |
Forwarded to the active llama_memory impl (src/llama-memory*.cpp) |
llama_tokenize, llama_token_to_piece, llama_detokenize |
llama_vocab |
llama_chat_apply_template |
src/llama-chat.cpp |
llama_sampler_* |
src/llama-sampler.cpp |
llama_grammar_* |
src/llama-grammar.cpp |
llama_adapter_lora_* |
src/llama-adapter.cpp |
llama_state_* |
Save/restore via src/llama-context.cpp and src/llama-model-saver.cpp |
Registration of backends
llama_backend_init calls ggml_backend_load_all, which iterates the registered backends in ggml/src/ggml-backend-reg.cpp. With BUILD_SHARED_LIBS=ON, each backend ships as a separate libggml-<backend>.so/.dll and is loaded through ggml/src/ggml-backend-dl.cpp.
graph TD
App[Tool / app] -->|llama_backend_init| Llama[src/llama.cpp]
Llama --> Reg[ggml-backend-reg.cpp]
Reg --> CPU[ggml-cpu]
Reg --> CUDA[ggml-cuda]
Reg --> Metal[ggml-metal]
Reg --> Vulkan[ggml-vulkan]
Reg --> Other[sycl, opencl, hip, hexagon, rpc, webgpu, ...]Implementation file
| File | Lines (~) | Purpose |
|---|---|---|
src/llama.cpp |
19k | Public C API → internal C++ glue, backend init/free |
src/llama-impl.h / .cpp |
6k | Logging macros, internal asserts, helper utilities |
src/llama-ext.h |
3k | Extra symbols not in the stable public header |
Where to start when reading the code
- Open
include/llama.hand pick the function you care about (e.g.llama_decode). - Find its definition in
src/llama.cpp— it's typically a one- or two-line forwarder. - Follow the call into
src/llama-context.cpp,src/llama-model.cpp, etc. for the actual work.
Most "real" inference logic lives in llama-context.cpp (the largest of the per-subsystem files) and the per-architecture builders under src/models/.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Systems
Next
Model loader