ggml-org/llama.cpp

Library entry point

Active contributors: Georgi Gerganov

src/llama.cpp is small and acts as glue. The bulk of the library lives in sibling files (llama-model.cpp, llama-context.cpp, llama-vocab.cpp, ...). This page describes how the entry point wires everything together so newcomers can find their way around.

Purpose

Bind the public C API declared in include/llama.h to the internal C++ subsystems, register the GGML backends, and own the global init/free machinery.

Key entry points (in `include/llama.h`)

Public symbol	Bound to
`llama_backend_init` / `llama_backend_free`	Global GGML init via `ggml_backend_load_all` (`src/llama.cpp`)
`llama_model_load_from_file` / `llama_model_load_from_splits`	`llama_model_loader::load_*` in `src/llama-model-loader.cpp`
`llama_model_free`	`llama_model::~llama_model` in `src/llama-model.cpp`
`llama_init_from_model`	`llama_context` constructor in `src/llama-context.cpp`
`llama_free`	`llama_context::~llama_context`
`llama_decode`, `llama_encode`	`llama_context::decode` / `encode` (graph build → schedule → run)
`llama_get_logits`, `llama_get_embeddings`	`llama_context` accessors
`llama_kv_`, `llama_memory_`	Forwarded to the active `llama_memory` impl (`src/llama-memory*.cpp`)
`llama_tokenize`, `llama_token_to_piece`, `llama_detokenize`	`llama_vocab`
`llama_chat_apply_template`	`src/llama-chat.cpp`
`llama_sampler_*`	`src/llama-sampler.cpp`
`llama_grammar_*`	`src/llama-grammar.cpp`
`llama_adapter_lora_*`	`src/llama-adapter.cpp`
`llama_state_*`	Save/restore via `src/llama-context.cpp` and `src/llama-model-saver.cpp`

Registration of backends

llama_backend_init calls ggml_backend_load_all, which iterates the registered backends in ggml/src/ggml-backend-reg.cpp. With BUILD_SHARED_LIBS=ON, each backend ships as a separate libggml-<backend>.so/.dll and is loaded through ggml/src/ggml-backend-dl.cpp.

graph TD
    App[Tool / app] -->|llama_backend_init| Llama[src/llama.cpp]
    Llama --> Reg[ggml-backend-reg.cpp]
    Reg --> CPU[ggml-cpu]
    Reg --> CUDA[ggml-cuda]
    Reg --> Metal[ggml-metal]
    Reg --> Vulkan[ggml-vulkan]
    Reg --> Other[sycl, opencl, hip, hexagon, rpc, webgpu, ...]

Implementation file

File	Lines (~)	Purpose
`src/llama.cpp`	19k	Public C API → internal C++ glue, backend init/free
`src/llama-impl.h` / `.cpp`	6k	Logging macros, internal asserts, helper utilities
`src/llama-ext.h`	3k	Extra symbols not in the stable public header

Where to start when reading the code

Open include/llama.h and pick the function you care about (e.g. llama_decode).
Find its definition in src/llama.cpp — it's typically a one- or two-line forwarder.
Follow the call into src/llama-context.cpp, src/llama-model.cpp, etc. for the actual work.

Most "real" inference logic lives in llama-context.cpp (the largest of the per-subsystem files) and the per-architecture builders under src/models/.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Purpose

Key entry points (in include/llama.h)

Registration of backends

Implementation file

Where to start when reading the code

Key entry points (in `include/llama.h`)