Factory.ai

Open-Source Wikis

/

llama.cpp

/

Packages

/

ggml

ggml-org/llama.cpp

ggml

Active contributors: Georgi Gerganov

ggml is the tensor library that llama.cpp is built on. It lives under ggml/ in this repo and is also published as a standalone project at ggml-org/ggml. The two are kept in sync via scripts/sync-ggml*.

README.md describes the relationship: "The llama.cpp project is the main playground for developing new features for the ggml library."

Purpose

  • Provide a portable tensor representation (ggml_tensor).
  • Build computation graphs (ggml_cgraph).
  • Run those graphs on a pluggable set of accelerator backends.
  • Carry the GGUF format reader/writer.

Directory layout

ggml/
├── CMakeLists.txt
├── include/                      # Public headers
│   ├── ggml.h                    # Core types, ops, helpers
│   ├── ggml-alloc.h              # Tensor allocators
│   ├── ggml-backend.h            # Backend & scheduler API
│   ├── ggml-cpu.h                # CPU-specific helpers (threadpool, cpuinfo)
│   ├── ggml-opt.h                # Optimizer (used by examples/training)
│   ├── ggml-rpc.h, ggml-cuda.h, ggml-metal.h, ggml-vulkan.h, ggml-sycl.h, ...
│   └── gguf.h                    # GGUF reader/writer
├── src/
│   ├── ggml.c                    # Core compute graph + many CPU ops (~248 KB)
│   ├── ggml.cpp                  # Thin C++ wrappers
│   ├── ggml-alloc.c              # Memory allocators (~48 KB)
│   ├── ggml-backend.cpp          # Scheduler + cross-backend buffer copy (~93 KB)
│   ├── ggml-backend-reg.cpp      # Backend registry (built-in + dynamic)
│   ├── ggml-backend-dl.cpp       # Dynamic plugin loader
│   ├── ggml-backend-impl.h       # Interface every backend implements
│   ├── ggml-backend-meta.cpp     # Per-tensor metadata
│   ├── ggml-quants.c             # Reference per-quant-type kernels (~222 KB)
│   ├── ggml-common.h             # Block layouts shared with kernels (~135 KB)
│   ├── ggml-impl.h               # Internal types
│   ├── ggml-opt.cpp              # Adam, SGD, gradient bookkeeping
│   ├── ggml-threading.{cpp,h}    # Thread sync primitives
│   ├── gguf.cpp                  # GGUF reader/writer (~53 KB)
│   ├── ggml-cpu/                 # CPU backend
│   ├── ggml-cuda/                # CUDA backend
│   ├── ggml-metal/               # Metal backend
│   ├── ggml-vulkan/              # Vulkan backend
│   ├── ggml-sycl/                # SYCL backend
│   ├── ggml-opencl/              # OpenCL backend
│   ├── ggml-hip/                 # AMD HIP wrapper around ggml-cuda
│   ├── ggml-musa/                # Moore Threads wrapper
│   ├── ggml-hexagon/             # Qualcomm Hexagon
│   ├── ggml-cann/                # Huawei CANN
│   ├── ggml-rpc/                 # RPC backend
│   ├── ggml-webgpu/              # WebGPU
│   ├── ggml-openvino/            # OpenVINO
│   ├── ggml-zdnn/                # IBM Z
│   ├── ggml-zendnn/              # AMD ZenDNN
│   ├── ggml-virtgpu/             # virtio-gpu
│   └── ggml-blas/                # System BLAS shim
└── cmake/

Key abstractions

Type Role File
ggml_tensor n-dim tensor with type, shape, strides, data pointer ggml/include/ggml.h
ggml_context Allocation arena ggml/include/ggml.h
ggml_cgraph A computation graph ggml/include/ggml.h
ggml_op The op enum (MUL_MAT, ROPE, ATTN_*, RMS_NORM, ...) ggml/include/ggml.h
ggml_type Element type / quant format (F32, F16, BF16, Q4_K, IQ2_XS, MXFP4, ...) ggml/include/ggml.h
ggml_backend_t A handle to one device on one backend ggml/include/ggml-backend.h
ggml_backend_buffer_t Memory region on a backend ggml/include/ggml-backend.h
ggml_backend_sched_t Multi-backend scheduler ggml/include/ggml-backend.h
gguf_context GGUF file handle ggml/include/gguf.h

How a graph runs

graph LR
    A[ggml_init alloc context] --> B[Build tensors and ops]
    B --> C[ggml_new_graph + ggml_build_forward_expand]
    C --> D[ggml_backend_sched_alloc_graph]
    D --> E[ggml_backend_sched_graph_compute]
    E --> F[results in tensor->data]

For inference, llama.cpp builds a fresh graph on every llama_decode (one per batch), schedules it, runs it, and reads logits out. See Computation graph.

GGUF

ggml/include/gguf.h and ggml/src/gguf.cpp are the in-tree GGUF reader/writer. They handle:

  • Magic + version validation.
  • Typed key/value metadata.
  • Tensor headers (name, shape, type, offset).
  • Lazy mmap of tensor bodies.

Python equivalent lives in gguf-py/. See gguf-py.

Optimizer

ggml-opt.cpp provides Adam / SGD plus gradient-tape machinery. Used by examples/training/ and the (now removed) standalone training tools. Inference does not use it.

Sync with the standalone repo

Changes to ggml/ here are mirrored to/from ggml-org/ggml using:

  • scripts/sync-ggml.sh
  • scripts/sync-ggml-am.sh
  • scripts/sync-ggml.last

The standalone repo holds extra examples (Whisper, GPT-2, MNIST) but otherwise tracks the same code.

Integration points

  • libllama is the largest in-tree consumer.
  • Backends. Each backend lives under ggml/src/ggml-<name>/ and ships its own kernels.
  • Tools. llama-quantize, llama-perplexity, etc. all use ggml indirectly through libllama.
  • Conformance. tests/test-backend-ops.cpp is the ground truth for ggml correctness.

Entry points for modification

  • New op. Add to enum ggml_op, implement in ggml.c (CPU reference) and at least one accelerator backend, register in ggml-backend-impl.h op-support table, add a tests/test-backend-ops.cpp case.
  • New ggml_type. Edit enum ggml_type, add the block layout to ggml-common.h, implement reference kernels in ggml-quants.c, then add per-backend specializations.
  • New backend. Subclass the interface in ggml-backend-impl.h. The CUDA, Vulkan, and Metal backends are the canonical references, ranging from heaviest (CUDA) to most self-contained (Metal).

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

ggml – llama.cpp wiki | Factory