ggml-org/llama.cpp

Computation graph

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

Every call to llama_decode builds a fresh ggml_cgraph for the given batch, hands it to ggml_backend_sched for split-and-execute, and returns logits. src/llama-graph.cpp is where that graph is constructed.

Purpose

Build a computation graph that evaluates the model for the tokens in a llama_batch.
Connect each architecture's per-model builder (src/models/<arch>.cpp) to shared layer primitives (RoPE, attention, MoE routing, layer norms).
Provide a uniform interface to the active memory backend (llama-kv-cache, llama-kv-cache-iswa, llama-memory-recurrent, llama-memory-hybrid).

Directory layout

src/
├── llama-graph.h         # llm_build_*, helpers for attention/RoPE/FFN/MoE/SSM
├── llama-graph.cpp       # ~101 KB; entry point + shared primitives
├── llama-context.cpp     # owns the graph build + scheduler + KV cache
└── models/<arch>.cpp     # per-arch builder using llama-graph helpers

Key abstractions

Type	Role
`llm_graph_input_*`	Per-batch input tensors (token ids, positions, mask, MoE expert mapping, ...)
`llm_graph_context`	Holds the current graph, the model's hparams/cparams, and the memory state
`llm_build_<arch>` (in `src/models/`)	Subclass that wires layers together
`llm_graph_result`	Outputs: logits, embeddings, hidden states

How it works

graph TD
    Decode[llama_decode]
    Ctx[llama_context::decode]
    PrepBatch[llama_batch_allocr]
    Mem[memory_state apply / commit]
    BuildGraph[llama-graph.cpp build_graph]
    PerArch[src/models/<arch>.cpp llm_build_<arch>]
    Sched[ggml_backend_sched_alloc_graph]
    Compute[ggml_backend_sched_graph_compute]
    Output[get logits / embeddings]

    Decode --> Ctx
    Ctx --> PrepBatch
    PrepBatch --> Mem
    Mem --> BuildGraph
    BuildGraph --> PerArch
    PerArch --> Sched
    Sched --> Compute
    Compute --> Output

Phases

Batch preparation. src/llama-batch.cpp validates and packs the user batch (llama_batch) into an internal representation with sequence ids and per-token positions.
Memory apply. The active memory backend (KV cache or recurrent state) computes which slots will be read/written and produces a "memory state" object the graph builds against. See KV cache and memory.
Graph build. llama-graph.cpp::build dispatches on model->arch and invokes the per-arch builder. The builder emits tensor ops using helpers like build_attn, build_ffn, build_moe_ffn, build_rope, build_rwkv6_time_mix, etc. — all defined in llama-graph.cpp.
Schedule. ggml_backend_sched_alloc_graph splits the graph across the registered backends, allocating per-tensor buffers.
Compute. ggml_backend_sched_graph_compute runs the kernels. Each backend implements its kernels in ggml/src/ggml-<backend>/.
Output. llama-context.cpp reads the logits/embeddings tensor out of the scheduler and exposes them via llama_get_logits / llama_get_embeddings.

Shared layer primitives

src/llama-graph.cpp exposes building blocks that almost every architecture uses:

Primitive	Purpose
`build_attn`	Multi-head attention with optional GQA, ALiBi, sliding window
`build_attn_qkv`	Combined QKV projection variants
`build_rope`	RoPE position encoding (vanilla, NeoX, GLM, frequency scaling)
`build_ffn`	Dense feed-forward (gate/up/down)
`build_moe_ffn`	Mixture-of-Experts routing + expert selection
`build_norm`	RMSNorm / LayerNorm wrapper
`build_lora_mm`	LoRA-aware matmul that applies adapters when present
`build_rwkv*_time_mix` / `build_mamba_layer`	Recurrent / SSM cores
`build_inp_*`	Helpers to materialize per-batch input tensors

A typical src/models/<arch>.cpp is ~200–800 lines that call these primitives in the right order.

Output paths

A graph can produce one of three things, controlled by cparams.embeddings and similar flags:

Logits for the last token of each sequence — the standard generation case.
All-token logits when --logits-all is set.
Embeddings, either pooled (mean / cls / last) or per-token, when the context was created with embeddings=true. The pooling type is encoded in GGUF metadata for embedding models.

Integration points

llama-context. Owns the graph build, the GGML scheduler, and the per-call buffers. Most callers only see llama_decode.
ggml-backend-sched. Decides which tensors live on which backend; tunable via --tensor-split, --main-gpu, --n-gpu-layers.
Memory backends. The graph reads from and writes to whichever llama_memory was selected at context init. See KV cache and memory.
Adapters. When LoRA adapters are loaded, build_lora_mm injects them. See Adapters.

Entry points for modification

Add a new layer primitive. Add a method to llm_graph_context in src/llama-graph.h/.cpp. Existing primitives are the right reference for how to expose backend-portable ops.
Tweak a single architecture. Edit src/models/<arch>.cpp only — the shared primitives and dispatcher should rarely change.
Add a new ggml op. Add it to ggml/include/ggml.h + ggml/src/ggml.c (CPU reference) + each backend that should support it. Add a test case to tests/test-backend-ops.cpp.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.