ggml-org/llama.cpp

ggml

Active contributors: Georgi Gerganov

ggml is the tensor library that llama.cpp is built on. It lives under ggml/ in this repo and is also published as a standalone project at ggml-org/ggml. The two are kept in sync via scripts/sync-ggml*.

README.md describes the relationship: "The llama.cpp project is the main playground for developing new features for the ggml library."

Purpose

Provide a portable tensor representation (ggml_tensor).
Build computation graphs (ggml_cgraph).
Run those graphs on a pluggable set of accelerator backends.
Carry the GGUF format reader/writer.

Directory layout

ggml/
├── CMakeLists.txt
├── include/                      # Public headers
│   ├── ggml.h                    # Core types, ops, helpers
│   ├── ggml-alloc.h              # Tensor allocators
│   ├── ggml-backend.h            # Backend & scheduler API
│   ├── ggml-cpu.h                # CPU-specific helpers (threadpool, cpuinfo)
│   ├── ggml-opt.h                # Optimizer (used by examples/training)
│   ├── ggml-rpc.h, ggml-cuda.h, ggml-metal.h, ggml-vulkan.h, ggml-sycl.h, ...
│   └── gguf.h                    # GGUF reader/writer
├── src/
│   ├── ggml.c                    # Core compute graph + many CPU ops (~248 KB)
│   ├── ggml.cpp                  # Thin C++ wrappers
│   ├── ggml-alloc.c              # Memory allocators (~48 KB)
│   ├── ggml-backend.cpp          # Scheduler + cross-backend buffer copy (~93 KB)
│   ├── ggml-backend-reg.cpp      # Backend registry (built-in + dynamic)
│   ├── ggml-backend-dl.cpp       # Dynamic plugin loader
│   ├── ggml-backend-impl.h       # Interface every backend implements
│   ├── ggml-backend-meta.cpp     # Per-tensor metadata
│   ├── ggml-quants.c             # Reference per-quant-type kernels (~222 KB)
│   ├── ggml-common.h             # Block layouts shared with kernels (~135 KB)
│   ├── ggml-impl.h               # Internal types
│   ├── ggml-opt.cpp              # Adam, SGD, gradient bookkeeping
│   ├── ggml-threading.{cpp,h}    # Thread sync primitives
│   ├── gguf.cpp                  # GGUF reader/writer (~53 KB)
│   ├── ggml-cpu/                 # CPU backend
│   ├── ggml-cuda/                # CUDA backend
│   ├── ggml-metal/               # Metal backend
│   ├── ggml-vulkan/              # Vulkan backend
│   ├── ggml-sycl/                # SYCL backend
│   ├── ggml-opencl/              # OpenCL backend
│   ├── ggml-hip/                 # AMD HIP wrapper around ggml-cuda
│   ├── ggml-musa/                # Moore Threads wrapper
│   ├── ggml-hexagon/             # Qualcomm Hexagon
│   ├── ggml-cann/                # Huawei CANN
│   ├── ggml-rpc/                 # RPC backend
│   ├── ggml-webgpu/              # WebGPU
│   ├── ggml-openvino/            # OpenVINO
│   ├── ggml-zdnn/                # IBM Z
│   ├── ggml-zendnn/              # AMD ZenDNN
│   ├── ggml-virtgpu/             # virtio-gpu
│   └── ggml-blas/                # System BLAS shim
└── cmake/

Key abstractions

Type	Role	File
`ggml_tensor`	n-dim tensor with type, shape, strides, data pointer	`ggml/include/ggml.h`
`ggml_context`	Allocation arena	`ggml/include/ggml.h`
`ggml_cgraph`	A computation graph	`ggml/include/ggml.h`
`ggml_op`	The op enum (`MUL_MAT`, `ROPE`, `ATTN_*`, `RMS_NORM`, ...)	`ggml/include/ggml.h`
`ggml_type`	Element type / quant format (`F32`, `F16`, `BF16`, `Q4_K`, `IQ2_XS`, `MXFP4`, ...)	`ggml/include/ggml.h`
`ggml_backend_t`	A handle to one device on one backend	`ggml/include/ggml-backend.h`
`ggml_backend_buffer_t`	Memory region on a backend	`ggml/include/ggml-backend.h`
`ggml_backend_sched_t`	Multi-backend scheduler	`ggml/include/ggml-backend.h`
`gguf_context`	GGUF file handle	`ggml/include/gguf.h`

How a graph runs

graph LR
    A[ggml_init alloc context] --> B[Build tensors and ops]
    B --> C[ggml_new_graph + ggml_build_forward_expand]
    C --> D[ggml_backend_sched_alloc_graph]
    D --> E[ggml_backend_sched_graph_compute]
    E --> F[results in tensor->data]

For inference, llama.cpp builds a fresh graph on every llama_decode (one per batch), schedules it, runs it, and reads logits out. See Computation graph.

GGUF

ggml/include/gguf.h and ggml/src/gguf.cpp are the in-tree GGUF reader/writer. They handle:

Magic + version validation.
Typed key/value metadata.
Tensor headers (name, shape, type, offset).
Lazy mmap of tensor bodies.

Python equivalent lives in gguf-py/. See gguf-py.

Optimizer

ggml-opt.cpp provides Adam / SGD plus gradient-tape machinery. Used by examples/training/ and the (now removed) standalone training tools. Inference does not use it.

Sync with the standalone repo

Changes to ggml/ here are mirrored to/from ggml-org/ggml using:

scripts/sync-ggml.sh
scripts/sync-ggml-am.sh
scripts/sync-ggml.last

The standalone repo holds extra examples (Whisper, GPT-2, MNIST) but otherwise tracks the same code.

Integration points

libllama is the largest in-tree consumer.
Backends. Each backend lives under ggml/src/ggml-<name>/ and ships its own kernels.
Tools. llama-quantize, llama-perplexity, etc. all use ggml indirectly through libllama.
Conformance. tests/test-backend-ops.cpp is the ground truth for ggml correctness.

Entry points for modification

New op. Add to enum ggml_op, implement in ggml.c (CPU reference) and at least one accelerator backend, register in ggml-backend-impl.h op-support table, add a tests/test-backend-ops.cpp case.
New ggml_type. Edit enum ggml_type, add the block layout to ggml-common.h, implement reference kernels in ggml-quants.c, then add per-backend specializations.
New backend. Subclass the interface in ggml-backend-impl.h. The CUDA, Vulkan, and Metal backends are the canonical references, ranging from heaviest (CUDA) to most self-contained (Metal).

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.