ggml-org/llama.cpp
Systems
The libllama library is built from src/. Logically it is organized into a handful of internal subsystems that together turn a GGUF file plus a sequence of tokens into logits, samples, and chat-formatted text. Each subsystem is roughly one or two .cpp files plus matching header.
This wiki uses the systems lens because llama.cpp's main library is best understood as architectural building blocks, not as a workspace of packages or independent applications. The deployable units of the project (the binaries) are documented under Tools.
Map
graph LR
subgraph Loading
Loader[llama-model-loader]
Mmap[llama-mmap]
Adapter[llama-adapter]
Saver[llama-model-saver]
end
subgraph Model
Arch[llama-arch]
Hparams[llama-hparams]
Cparams[llama-cparams]
Model2[llama-model]
Vocab[llama-vocab]
Models[src/models/*]
end
subgraph Runtime
Context[llama-context]
Batch[llama-batch]
Graph[llama-graph]
IO[llama-io]
end
subgraph Memory
Memory[llama-memory]
KV[llama-kv-cache]
ISWA[llama-kv-cache-iswa]
Recurrent[llama-memory-recurrent]
Hybrid[llama-memory-hybrid + iswa]
end
subgraph Generation
Sampler[llama-sampler]
Grammar[llama-grammar]
Chat[llama-chat]
end
Loader --> Model2
Model2 --> Models
Model2 --> Context
Context --> Batch
Context --> Graph
Graph --> Memory
Memory --> KV
Memory --> ISWA
Memory --> Recurrent
Memory --> Hybrid
Context --> Sampler
Sampler --> Grammar
Context --> ChatPages in this section
- Library entry point — what
src/llama.cppactually contains and how the pieces are wired. - Model loader — reading GGUF files and constructing
llama_model. - Architecture switch — how
LLM_ARCH_*enum values map to per-model graph builders undersrc/models/. - Vocab and tokenizer — the BPE/SPM/WPM/UGM/RWKV implementations.
- Computation graph — how
llama-graph.cppbuilds per-batch tensor graphs. - KV cache and memory — the standard, sliding-window, recurrent, and hybrid memory backends.
- Sampler — the chained sampler API and built-in samplers.
- Grammar — GBNF parser and llguidance integration.
- Chat templates —
llama-chat(built-in) pluscommon/chat.cpp(Jinja-based). - Quantization — the
llama-quantizepath and supported quant types. - Adapters — LoRA and control-vector loading.
Key abstractions
| Type | Role | File |
|---|---|---|
llama_model |
Loaded weights + vocab + hparams | src/llama-model.h |
llama_vocab |
Tokenizer state | src/llama-vocab.h |
llama_context |
Per-conversation runtime state | src/llama-context.h |
llama_batch |
Tokens to decode in one call | include/llama.h, src/llama-batch.cpp |
llama_kv_cache |
KV slots for attention | src/llama-kv-cache.h |
llama_sampler |
A single sampling step | src/llama-sampler.cpp (chain in include/llama.h) |
llama_grammar |
Constrained-decode rule set | src/llama-grammar.h |
llama_adapter_lora |
LoRA adapter | src/llama-adapter.h |
For per-tool usage of these subsystems, see Tools. For the public C API surface that exposes them, see API.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Tooling
Next
Library entry point