ggml-org/llama.cpp
Computation graph
Active contributors: Sigbjørn Skjæret, Georgi Gerganov
Every call to llama_decode builds a fresh ggml_cgraph for the given batch, hands it to ggml_backend_sched for split-and-execute, and returns logits. src/llama-graph.cpp is where that graph is constructed.
Purpose
- Build a computation graph that evaluates the model for the tokens in a
llama_batch. - Connect each architecture's per-model builder (
src/models/<arch>.cpp) to shared layer primitives (RoPE, attention, MoE routing, layer norms). - Provide a uniform interface to the active memory backend (
llama-kv-cache,llama-kv-cache-iswa,llama-memory-recurrent,llama-memory-hybrid).
Directory layout
src/
├── llama-graph.h # llm_build_*, helpers for attention/RoPE/FFN/MoE/SSM
├── llama-graph.cpp # ~101 KB; entry point + shared primitives
├── llama-context.cpp # owns the graph build + scheduler + KV cache
└── models/<arch>.cpp # per-arch builder using llama-graph helpersKey abstractions
| Type | Role |
|---|---|
llm_graph_input_* |
Per-batch input tensors (token ids, positions, mask, MoE expert mapping, ...) |
llm_graph_context |
Holds the current graph, the model's hparams/cparams, and the memory state |
llm_build_<arch> (in src/models/) |
Subclass that wires layers together |
llm_graph_result |
Outputs: logits, embeddings, hidden states |
How it works
graph TD
Decode[llama_decode]
Ctx[llama_context::decode]
PrepBatch[llama_batch_allocr]
Mem[memory_state apply / commit]
BuildGraph[llama-graph.cpp build_graph]
PerArch[src/models/<arch>.cpp llm_build_<arch>]
Sched[ggml_backend_sched_alloc_graph]
Compute[ggml_backend_sched_graph_compute]
Output[get logits / embeddings]
Decode --> Ctx
Ctx --> PrepBatch
PrepBatch --> Mem
Mem --> BuildGraph
BuildGraph --> PerArch
PerArch --> Sched
Sched --> Compute
Compute --> OutputPhases
- Batch preparation.
src/llama-batch.cppvalidates and packs the user batch (llama_batch) into an internal representation with sequence ids and per-token positions. - Memory apply. The active memory backend (KV cache or recurrent state) computes which slots will be read/written and produces a "memory state" object the graph builds against. See KV cache and memory.
- Graph build.
llama-graph.cpp::builddispatches onmodel->archand invokes the per-arch builder. The builder emits tensor ops using helpers likebuild_attn,build_ffn,build_moe_ffn,build_rope,build_rwkv6_time_mix, etc. — all defined inllama-graph.cpp. - Schedule.
ggml_backend_sched_alloc_graphsplits the graph across the registered backends, allocating per-tensor buffers. - Compute.
ggml_backend_sched_graph_computeruns the kernels. Each backend implements its kernels inggml/src/ggml-<backend>/. - Output.
llama-context.cppreads the logits/embeddings tensor out of the scheduler and exposes them viallama_get_logits/llama_get_embeddings.
Shared layer primitives
src/llama-graph.cpp exposes building blocks that almost every architecture uses:
| Primitive | Purpose |
|---|---|
build_attn |
Multi-head attention with optional GQA, ALiBi, sliding window |
build_attn_qkv |
Combined QKV projection variants |
build_rope |
RoPE position encoding (vanilla, NeoX, GLM, frequency scaling) |
build_ffn |
Dense feed-forward (gate/up/down) |
build_moe_ffn |
Mixture-of-Experts routing + expert selection |
build_norm |
RMSNorm / LayerNorm wrapper |
build_lora_mm |
LoRA-aware matmul that applies adapters when present |
build_rwkv*_time_mix / build_mamba_layer |
Recurrent / SSM cores |
build_inp_* |
Helpers to materialize per-batch input tensors |
A typical src/models/<arch>.cpp is ~200–800 lines that call these primitives in the right order.
Output paths
A graph can produce one of three things, controlled by cparams.embeddings and similar flags:
- Logits for the last token of each sequence — the standard generation case.
- All-token logits when
--logits-allis set. - Embeddings, either pooled (mean / cls / last) or per-token, when the context was created with
embeddings=true. The pooling type is encoded in GGUF metadata for embedding models.
Integration points
llama-context. Owns the graph build, the GGML scheduler, and the per-call buffers. Most callers only seellama_decode.ggml-backend-sched. Decides which tensors live on which backend; tunable via--tensor-split,--main-gpu,--n-gpu-layers.- Memory backends. The graph reads from and writes to whichever
llama_memorywas selected at context init. See KV cache and memory. - Adapters. When LoRA adapters are loaded,
build_lora_mminjects them. See Adapters.
Entry points for modification
- Add a new layer primitive. Add a method to
llm_graph_contextinsrc/llama-graph.h/.cpp. Existing primitives are the right reference for how to expose backend-portable ops. - Tweak a single architecture. Edit
src/models/<arch>.cpponly — the shared primitives and dispatcher should rarely change. - Add a new ggml op. Add it to
ggml/include/ggml.h+ggml/src/ggml.c(CPU reference) + each backend that should support it. Add a test case totests/test-backend-ops.cpp.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Vocab and tokenizer
Next
KV cache and memory