ggml-org/llama.cpp
Architecture
llama.cpp is layered. At the bottom is ggml, a portable tensor library with one compute backend per accelerator family. On top of ggml sits libllama, which loads GGUF model files, builds the per-architecture computation graph, manages the KV cache, and exposes a stable C API in include/llama.h. On top of libllama sits common/, a collection of helpers (argument parsing, sampling presets, chat templates, downloading) that the in-tree command-line tools share. Each tool under tools/ is a separate binary built against libllama plus common.
graph TD
subgraph Tools["tools/ (binaries)"]
CLI[llama-cli]
Server[llama-server]
Quantize[llama-quantize]
Bench[llama-bench]
Mtmd[llama-mtmd-cli]
Perp[llama-perplexity]
Imatrix[llama-imatrix]
Other[gguf-split, tokenize, tts, ...]
end
subgraph Common["common/ (helpers)"]
Args[arg.cpp]
Sampling[sampling.cpp]
Chat[chat.cpp + jinja/]
Download[download.cpp + hf-cache.cpp]
Console[console.cpp + log.cpp]
end
subgraph Llama["src/ (libllama, public API in include/llama.h)"]
ModelLoader[llama-model-loader]
Model[llama-model]
Vocab[llama-vocab]
Context[llama-context]
Graph[llama-graph]
KV[llama-kv-cache + memory-*]
Sampler[llama-sampler]
Grammar[llama-grammar]
Chat2[llama-chat]
end
subgraph GGML["ggml/ (libggml)"]
GgmlCore[ggml.c / ggml-alloc / ggml-backend]
Quants[ggml-quants.c]
CPU[ggml-cpu/]
CUDA[ggml-cuda/]
Metal[ggml-metal/]
Vulkan[ggml-vulkan/]
Other2[sycl, opencl, hexagon, rpc, webgpu, zdnn, ...]
end
Tools --> Common
Tools --> Llama
Common --> Llama
Llama --> GGML
GgmlCore --> CPU
GgmlCore --> CUDA
GgmlCore --> Metal
GgmlCore --> Vulkan
GgmlCore --> Other2Inference data flow
A typical generation request walks the stack like this:
sequenceDiagram
participant U as User / tool
participant API as libllama (llama.h)
participant Loader as llama-model-loader
participant Vocab as llama-vocab
participant Ctx as llama-context
participant Graph as llama-graph
participant KV as llama-kv-cache
participant Smp as llama-sampler
participant GGML as ggml backend
U->>API: llama_model_load_from_file(path)
API->>Loader: read GGUF header + metadata
Loader->>Vocab: build tokenizer
Loader-->>API: llama_model
U->>API: llama_init_from_model(params)
API->>Ctx: allocate context, KV cache, scheduler
U->>API: llama_tokenize(text)
API->>Vocab: BPE / SPM / WPM / UGM
U->>API: llama_decode(batch)
API->>Graph: build_graph(arch, batch)
Graph->>KV: read / write KV slots
Graph->>GGML: ggml_backend_sched_compute(graph)
GGML-->>API: logits
U->>API: llama_sampler_sample(ctx, logits)
API->>Smp: chain (top-k, top-p, temp, grammar, penalties)
Smp-->>U: next tokenThe llama_decode call is where most of the cost lives. It builds a per-batch ggml computation graph (different for each architecture — see the per-model files under src/models/), splits it across the registered backends with ggml_backend_sched, runs the kernels, and returns logits or embeddings.
Backends
GGML backends are loaded at runtime through ggml-backend-reg.cpp. Each backend implements a small interface (allocator, buffer type, device list, kernel dispatch) declared in ggml/src/ggml-backend-impl.h. Some backends (CUDA, Metal, Vulkan) ship as dynamically loaded plugins through ggml-backend-dl.cpp so the same libllama build can pick the right accelerator at runtime.
Schedulers in ggml/src/ggml-backend.cpp decide which tensors live on which backend. CPU+GPU hybrid inference is implemented by letting ggml_backend_sched move tensors between CPU and GPU buffers, allowing models larger than VRAM to spill to RAM.
See Backends for per-backend details.
Process layout
llama.cpp is intentionally single-process. There is no daemon, no IPC layer, and no shared scheduler. Each tool binary is self-contained: it loads a model, holds it in memory, and exits when it finishes. The HTTP server (tools/server) is the one exception: it owns a single in-process model and serves multiple HTTP clients via a queue (tools/server/server-queue.cpp) and a per-slot server_context (tools/server/server-context.cpp).
File and directory map
| Top-level path | Purpose |
|---|---|
include/llama.h |
Public C API |
include/llama-cpp.h |
C++ smart-pointer convenience header |
src/ |
libllama implementation |
src/models/ |
Per-architecture graph builders (LLaMA, Gemma, Qwen, Phi, Mamba, RWKV, ...) |
ggml/include/, ggml/src/ |
libggml and its backends |
common/ |
Shared helpers used by every binary tool |
tools/ |
Standalone command-line programs |
examples/ |
Smaller demos and platform integrations (Android, Swift, vim plugin) |
tests/ |
C++ unit and integration tests |
convert_*.py, gguf-py/ |
Python tooling for GGUF conversion |
docs/ |
Markdown user docs (build, backends, multimodal, function calling, ops) |
vendor/ |
Third-party single-header libraries |
cmake/, CMakeLists.txt, CMakePresets.json |
CMake build system |
.github/workflows/ |
CI definitions (lint, build, server, release) |
ci/ |
Long-form CI scripts run on ggml-ci self-hosted runners |
models/templates/ |
Reference Jinja chat templates |
grammars/ |
Sample GBNF grammars used by the constrained sampler |
For statistics on size and churn, see By the numbers.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama.cpp
Next
Getting started