Factory.ai

Open-Source Wikis

/

llama.cpp

/

llama.cpp

/

Architecture

ggml-org/llama.cpp

Architecture

llama.cpp is layered. At the bottom is ggml, a portable tensor library with one compute backend per accelerator family. On top of ggml sits libllama, which loads GGUF model files, builds the per-architecture computation graph, manages the KV cache, and exposes a stable C API in include/llama.h. On top of libllama sits common/, a collection of helpers (argument parsing, sampling presets, chat templates, downloading) that the in-tree command-line tools share. Each tool under tools/ is a separate binary built against libllama plus common.

graph TD
    subgraph Tools["tools/  (binaries)"]
        CLI[llama-cli]
        Server[llama-server]
        Quantize[llama-quantize]
        Bench[llama-bench]
        Mtmd[llama-mtmd-cli]
        Perp[llama-perplexity]
        Imatrix[llama-imatrix]
        Other[gguf-split, tokenize, tts, ...]
    end

    subgraph Common["common/  (helpers)"]
        Args[arg.cpp]
        Sampling[sampling.cpp]
        Chat[chat.cpp + jinja/]
        Download[download.cpp + hf-cache.cpp]
        Console[console.cpp + log.cpp]
    end

    subgraph Llama["src/  (libllama, public API in include/llama.h)"]
        ModelLoader[llama-model-loader]
        Model[llama-model]
        Vocab[llama-vocab]
        Context[llama-context]
        Graph[llama-graph]
        KV[llama-kv-cache + memory-*]
        Sampler[llama-sampler]
        Grammar[llama-grammar]
        Chat2[llama-chat]
    end

    subgraph GGML["ggml/  (libggml)"]
        GgmlCore[ggml.c / ggml-alloc / ggml-backend]
        Quants[ggml-quants.c]
        CPU[ggml-cpu/]
        CUDA[ggml-cuda/]
        Metal[ggml-metal/]
        Vulkan[ggml-vulkan/]
        Other2[sycl, opencl, hexagon, rpc, webgpu, zdnn, ...]
    end

    Tools --> Common
    Tools --> Llama
    Common --> Llama
    Llama --> GGML
    GgmlCore --> CPU
    GgmlCore --> CUDA
    GgmlCore --> Metal
    GgmlCore --> Vulkan
    GgmlCore --> Other2

Inference data flow

A typical generation request walks the stack like this:

sequenceDiagram
    participant U as User / tool
    participant API as libllama (llama.h)
    participant Loader as llama-model-loader
    participant Vocab as llama-vocab
    participant Ctx as llama-context
    participant Graph as llama-graph
    participant KV as llama-kv-cache
    participant Smp as llama-sampler
    participant GGML as ggml backend

    U->>API: llama_model_load_from_file(path)
    API->>Loader: read GGUF header + metadata
    Loader->>Vocab: build tokenizer
    Loader-->>API: llama_model
    U->>API: llama_init_from_model(params)
    API->>Ctx: allocate context, KV cache, scheduler
    U->>API: llama_tokenize(text)
    API->>Vocab: BPE / SPM / WPM / UGM
    U->>API: llama_decode(batch)
    API->>Graph: build_graph(arch, batch)
    Graph->>KV: read / write KV slots
    Graph->>GGML: ggml_backend_sched_compute(graph)
    GGML-->>API: logits
    U->>API: llama_sampler_sample(ctx, logits)
    API->>Smp: chain (top-k, top-p, temp, grammar, penalties)
    Smp-->>U: next token

The llama_decode call is where most of the cost lives. It builds a per-batch ggml computation graph (different for each architecture — see the per-model files under src/models/), splits it across the registered backends with ggml_backend_sched, runs the kernels, and returns logits or embeddings.

Backends

GGML backends are loaded at runtime through ggml-backend-reg.cpp. Each backend implements a small interface (allocator, buffer type, device list, kernel dispatch) declared in ggml/src/ggml-backend-impl.h. Some backends (CUDA, Metal, Vulkan) ship as dynamically loaded plugins through ggml-backend-dl.cpp so the same libllama build can pick the right accelerator at runtime.

Schedulers in ggml/src/ggml-backend.cpp decide which tensors live on which backend. CPU+GPU hybrid inference is implemented by letting ggml_backend_sched move tensors between CPU and GPU buffers, allowing models larger than VRAM to spill to RAM.

See Backends for per-backend details.

Process layout

llama.cpp is intentionally single-process. There is no daemon, no IPC layer, and no shared scheduler. Each tool binary is self-contained: it loads a model, holds it in memory, and exits when it finishes. The HTTP server (tools/server) is the one exception: it owns a single in-process model and serves multiple HTTP clients via a queue (tools/server/server-queue.cpp) and a per-slot server_context (tools/server/server-context.cpp).

File and directory map

Top-level path Purpose
include/llama.h Public C API
include/llama-cpp.h C++ smart-pointer convenience header
src/ libllama implementation
src/models/ Per-architecture graph builders (LLaMA, Gemma, Qwen, Phi, Mamba, RWKV, ...)
ggml/include/, ggml/src/ libggml and its backends
common/ Shared helpers used by every binary tool
tools/ Standalone command-line programs
examples/ Smaller demos and platform integrations (Android, Swift, vim plugin)
tests/ C++ unit and integration tests
convert_*.py, gguf-py/ Python tooling for GGUF conversion
docs/ Markdown user docs (build, backends, multimodal, function calling, ops)
vendor/ Third-party single-header libraries
cmake/, CMakeLists.txt, CMakePresets.json CMake build system
.github/workflows/ CI definitions (lint, build, server, release)
ci/ Long-form CI scripts run on ggml-ci self-hosted runners
models/templates/ Reference Jinja chat templates
grammars/ Sample GBNF grammars used by the constrained sampler

For statistics on size and churn, see By the numbers.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Architecture – llama.cpp wiki | Factory