Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Computation graph

ggml-org/llama.cpp

Computation graph

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

Every call to llama_decode builds a fresh ggml_cgraph for the given batch, hands it to ggml_backend_sched for split-and-execute, and returns logits. src/llama-graph.cpp is where that graph is constructed.

Purpose

  • Build a computation graph that evaluates the model for the tokens in a llama_batch.
  • Connect each architecture's per-model builder (src/models/<arch>.cpp) to shared layer primitives (RoPE, attention, MoE routing, layer norms).
  • Provide a uniform interface to the active memory backend (llama-kv-cache, llama-kv-cache-iswa, llama-memory-recurrent, llama-memory-hybrid).

Directory layout

src/
├── llama-graph.h         # llm_build_*, helpers for attention/RoPE/FFN/MoE/SSM
├── llama-graph.cpp       # ~101 KB; entry point + shared primitives
├── llama-context.cpp     # owns the graph build + scheduler + KV cache
└── models/<arch>.cpp     # per-arch builder using llama-graph helpers

Key abstractions

Type Role
llm_graph_input_* Per-batch input tensors (token ids, positions, mask, MoE expert mapping, ...)
llm_graph_context Holds the current graph, the model's hparams/cparams, and the memory state
llm_build_<arch> (in src/models/) Subclass that wires layers together
llm_graph_result Outputs: logits, embeddings, hidden states

How it works

graph TD
    Decode[llama_decode]
    Ctx[llama_context::decode]
    PrepBatch[llama_batch_allocr]
    Mem[memory_state apply / commit]
    BuildGraph[llama-graph.cpp build_graph]
    PerArch[src/models/<arch>.cpp llm_build_<arch>]
    Sched[ggml_backend_sched_alloc_graph]
    Compute[ggml_backend_sched_graph_compute]
    Output[get logits / embeddings]

    Decode --> Ctx
    Ctx --> PrepBatch
    PrepBatch --> Mem
    Mem --> BuildGraph
    BuildGraph --> PerArch
    PerArch --> Sched
    Sched --> Compute
    Compute --> Output

Phases

  1. Batch preparation. src/llama-batch.cpp validates and packs the user batch (llama_batch) into an internal representation with sequence ids and per-token positions.
  2. Memory apply. The active memory backend (KV cache or recurrent state) computes which slots will be read/written and produces a "memory state" object the graph builds against. See KV cache and memory.
  3. Graph build. llama-graph.cpp::build dispatches on model->arch and invokes the per-arch builder. The builder emits tensor ops using helpers like build_attn, build_ffn, build_moe_ffn, build_rope, build_rwkv6_time_mix, etc. — all defined in llama-graph.cpp.
  4. Schedule. ggml_backend_sched_alloc_graph splits the graph across the registered backends, allocating per-tensor buffers.
  5. Compute. ggml_backend_sched_graph_compute runs the kernels. Each backend implements its kernels in ggml/src/ggml-<backend>/.
  6. Output. llama-context.cpp reads the logits/embeddings tensor out of the scheduler and exposes them via llama_get_logits / llama_get_embeddings.

Shared layer primitives

src/llama-graph.cpp exposes building blocks that almost every architecture uses:

Primitive Purpose
build_attn Multi-head attention with optional GQA, ALiBi, sliding window
build_attn_qkv Combined QKV projection variants
build_rope RoPE position encoding (vanilla, NeoX, GLM, frequency scaling)
build_ffn Dense feed-forward (gate/up/down)
build_moe_ffn Mixture-of-Experts routing + expert selection
build_norm RMSNorm / LayerNorm wrapper
build_lora_mm LoRA-aware matmul that applies adapters when present
build_rwkv*_time_mix / build_mamba_layer Recurrent / SSM cores
build_inp_* Helpers to materialize per-batch input tensors

A typical src/models/<arch>.cpp is ~200–800 lines that call these primitives in the right order.

Output paths

A graph can produce one of three things, controlled by cparams.embeddings and similar flags:

  • Logits for the last token of each sequence — the standard generation case.
  • All-token logits when --logits-all is set.
  • Embeddings, either pooled (mean / cls / last) or per-token, when the context was created with embeddings=true. The pooling type is encoded in GGUF metadata for embedding models.

Integration points

  • llama-context. Owns the graph build, the GGML scheduler, and the per-call buffers. Most callers only see llama_decode.
  • ggml-backend-sched. Decides which tensors live on which backend; tunable via --tensor-split, --main-gpu, --n-gpu-layers.
  • Memory backends. The graph reads from and writes to whichever llama_memory was selected at context init. See KV cache and memory.
  • Adapters. When LoRA adapters are loaded, build_lora_mm injects them. See Adapters.

Entry points for modification

  • Add a new layer primitive. Add a method to llm_graph_context in src/llama-graph.h/.cpp. Existing primitives are the right reference for how to expose backend-portable ops.
  • Tweak a single architecture. Edit src/models/<arch>.cpp only — the shared primitives and dispatcher should rarely change.
  • Add a new ggml op. Add it to ggml/include/ggml.h + ggml/src/ggml.c (CPU reference) + each backend that should support it. Add a test case to tests/test-backend-ops.cpp.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Computation graph – llama.cpp wiki | Factory