ggml-org/llama.cpp

Architecture

llama.cpp is layered. At the bottom is ggml, a portable tensor library with one compute backend per accelerator family. On top of ggml sits libllama, which loads GGUF model files, builds the per-architecture computation graph, manages the KV cache, and exposes a stable C API in include/llama.h. On top of libllama sits common/, a collection of helpers (argument parsing, sampling presets, chat templates, downloading) that the in-tree command-line tools share. Each tool under tools/ is a separate binary built against libllama plus common.

graph TD
    subgraph Tools["tools/  (binaries)"]
        CLI[llama-cli]
        Server[llama-server]
        Quantize[llama-quantize]
        Bench[llama-bench]
        Mtmd[llama-mtmd-cli]
        Perp[llama-perplexity]
        Imatrix[llama-imatrix]
        Other[gguf-split, tokenize, tts, ...]
    end

    subgraph Common["common/  (helpers)"]
        Args[arg.cpp]
        Sampling[sampling.cpp]
        Chat[chat.cpp + jinja/]
        Download[download.cpp + hf-cache.cpp]
        Console[console.cpp + log.cpp]
    end

    subgraph Llama["src/  (libllama, public API in include/llama.h)"]
        ModelLoader[llama-model-loader]
        Model[llama-model]
        Vocab[llama-vocab]
        Context[llama-context]
        Graph[llama-graph]
        KV[llama-kv-cache + memory-*]
        Sampler[llama-sampler]
        Grammar[llama-grammar]
        Chat2[llama-chat]
    end

    subgraph GGML["ggml/  (libggml)"]
        GgmlCore[ggml.c / ggml-alloc / ggml-backend]
        Quants[ggml-quants.c]
        CPU[ggml-cpu/]
        CUDA[ggml-cuda/]
        Metal[ggml-metal/]
        Vulkan[ggml-vulkan/]
        Other2[sycl, opencl, hexagon, rpc, webgpu, zdnn, ...]
    end

    Tools --> Common
    Tools --> Llama
    Common --> Llama
    Llama --> GGML
    GgmlCore --> CPU
    GgmlCore --> CUDA
    GgmlCore --> Metal
    GgmlCore --> Vulkan
    GgmlCore --> Other2

Inference data flow

A typical generation request walks the stack like this:

sequenceDiagram
    participant U as User / tool
    participant API as libllama (llama.h)
    participant Loader as llama-model-loader
    participant Vocab as llama-vocab
    participant Ctx as llama-context
    participant Graph as llama-graph
    participant KV as llama-kv-cache
    participant Smp as llama-sampler
    participant GGML as ggml backend

    U->>API: llama_model_load_from_file(path)
    API->>Loader: read GGUF header + metadata
    Loader->>Vocab: build tokenizer
    Loader-->>API: llama_model
    U->>API: llama_init_from_model(params)
    API->>Ctx: allocate context, KV cache, scheduler
    U->>API: llama_tokenize(text)
    API->>Vocab: BPE / SPM / WPM / UGM
    U->>API: llama_decode(batch)
    API->>Graph: build_graph(arch, batch)
    Graph->>KV: read / write KV slots
    Graph->>GGML: ggml_backend_sched_compute(graph)
    GGML-->>API: logits
    U->>API: llama_sampler_sample(ctx, logits)
    API->>Smp: chain (top-k, top-p, temp, grammar, penalties)
    Smp-->>U: next token

The llama_decode call is where most of the cost lives. It builds a per-batch ggml computation graph (different for each architecture — see the per-model files under src/models/), splits it across the registered backends with ggml_backend_sched, runs the kernels, and returns logits or embeddings.

Backends

GGML backends are loaded at runtime through ggml-backend-reg.cpp. Each backend implements a small interface (allocator, buffer type, device list, kernel dispatch) declared in ggml/src/ggml-backend-impl.h. Some backends (CUDA, Metal, Vulkan) ship as dynamically loaded plugins through ggml-backend-dl.cpp so the same libllama build can pick the right accelerator at runtime.

Schedulers in ggml/src/ggml-backend.cpp decide which tensors live on which backend. CPU+GPU hybrid inference is implemented by letting ggml_backend_sched move tensors between CPU and GPU buffers, allowing models larger than VRAM to spill to RAM.

See Backends for per-backend details.

Process layout

llama.cpp is intentionally single-process. There is no daemon, no IPC layer, and no shared scheduler. Each tool binary is self-contained: it loads a model, holds it in memory, and exits when it finishes. The HTTP server (tools/server) is the one exception: it owns a single in-process model and serves multiple HTTP clients via a queue (tools/server/server-queue.cpp) and a per-slot server_context (tools/server/server-context.cpp).

File and directory map

Top-level path	Purpose
`include/llama.h`	Public C API
`include/llama-cpp.h`	C++ smart-pointer convenience header
`src/`	`libllama` implementation
`src/models/`	Per-architecture graph builders (LLaMA, Gemma, Qwen, Phi, Mamba, RWKV, ...)
`ggml/include/`, `ggml/src/`	`libggml` and its backends
`common/`	Shared helpers used by every binary tool
`tools/`	Standalone command-line programs
`examples/`	Smaller demos and platform integrations (Android, Swift, vim plugin)
`tests/`	C++ unit and integration tests
`convert_*.py`, `gguf-py/`	Python tooling for GGUF conversion
`docs/`	Markdown user docs (build, backends, multimodal, function calling, ops)
`vendor/`	Third-party single-header libraries
`cmake/`, `CMakeLists.txt`, `CMakePresets.json`	CMake build system
`.github/workflows/`	CI definitions (lint, build, server, release)
`ci/`	Long-form CI scripts run on `ggml-ci` self-hosted runners
`models/templates/`	Reference Jinja chat templates
`grammars/`	Sample GBNF grammars used by the constrained sampler

For statistics on size and churn, see By the numbers.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.