Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

ggml-org/llama.cpp

Systems

The libllama library is built from src/. Logically it is organized into a handful of internal subsystems that together turn a GGUF file plus a sequence of tokens into logits, samples, and chat-formatted text. Each subsystem is roughly one or two .cpp files plus matching header.

This wiki uses the systems lens because llama.cpp's main library is best understood as architectural building blocks, not as a workspace of packages or independent applications. The deployable units of the project (the binaries) are documented under Tools.

Map

graph LR
    subgraph Loading
        Loader[llama-model-loader]
        Mmap[llama-mmap]
        Adapter[llama-adapter]
        Saver[llama-model-saver]
    end

    subgraph Model
        Arch[llama-arch]
        Hparams[llama-hparams]
        Cparams[llama-cparams]
        Model2[llama-model]
        Vocab[llama-vocab]
        Models[src/models/*]
    end

    subgraph Runtime
        Context[llama-context]
        Batch[llama-batch]
        Graph[llama-graph]
        IO[llama-io]
    end

    subgraph Memory
        Memory[llama-memory]
        KV[llama-kv-cache]
        ISWA[llama-kv-cache-iswa]
        Recurrent[llama-memory-recurrent]
        Hybrid[llama-memory-hybrid + iswa]
    end

    subgraph Generation
        Sampler[llama-sampler]
        Grammar[llama-grammar]
        Chat[llama-chat]
    end

    Loader --> Model2
    Model2 --> Models
    Model2 --> Context
    Context --> Batch
    Context --> Graph
    Graph --> Memory
    Memory --> KV
    Memory --> ISWA
    Memory --> Recurrent
    Memory --> Hybrid
    Context --> Sampler
    Sampler --> Grammar
    Context --> Chat

Pages in this section

  • Library entry point — what src/llama.cpp actually contains and how the pieces are wired.
  • Model loader — reading GGUF files and constructing llama_model.
  • Architecture switch — how LLM_ARCH_* enum values map to per-model graph builders under src/models/.
  • Vocab and tokenizer — the BPE/SPM/WPM/UGM/RWKV implementations.
  • Computation graph — how llama-graph.cpp builds per-batch tensor graphs.
  • KV cache and memory — the standard, sliding-window, recurrent, and hybrid memory backends.
  • Sampler — the chained sampler API and built-in samplers.
  • Grammar — GBNF parser and llguidance integration.
  • Chat templatesllama-chat (built-in) plus common/chat.cpp (Jinja-based).
  • Quantization — the llama-quantize path and supported quant types.
  • Adapters — LoRA and control-vector loading.

Key abstractions

Type Role File
llama_model Loaded weights + vocab + hparams src/llama-model.h
llama_vocab Tokenizer state src/llama-vocab.h
llama_context Per-conversation runtime state src/llama-context.h
llama_batch Tokens to decode in one call include/llama.h, src/llama-batch.cpp
llama_kv_cache KV slots for attention src/llama-kv-cache.h
llama_sampler A single sampling step src/llama-sampler.cpp (chain in include/llama.h)
llama_grammar Constrained-decode rule set src/llama-grammar.h
llama_adapter_lora LoRA adapter src/llama-adapter.h

For per-tool usage of these subsystems, see Tools. For the public C API surface that exposes them, see API.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Systems – llama.cpp wiki | Factory