ggml-org/llama.cpp
ggml
Active contributors: Georgi Gerganov
ggml is the tensor library that llama.cpp is built on. It lives under ggml/ in this repo and is also published as a standalone project at ggml-org/ggml. The two are kept in sync via scripts/sync-ggml*.
README.md describes the relationship: "The llama.cpp project is the main playground for developing new features for the ggml library."
Purpose
- Provide a portable tensor representation (
ggml_tensor). - Build computation graphs (
ggml_cgraph). - Run those graphs on a pluggable set of accelerator backends.
- Carry the GGUF format reader/writer.
Directory layout
ggml/
├── CMakeLists.txt
├── include/ # Public headers
│ ├── ggml.h # Core types, ops, helpers
│ ├── ggml-alloc.h # Tensor allocators
│ ├── ggml-backend.h # Backend & scheduler API
│ ├── ggml-cpu.h # CPU-specific helpers (threadpool, cpuinfo)
│ ├── ggml-opt.h # Optimizer (used by examples/training)
│ ├── ggml-rpc.h, ggml-cuda.h, ggml-metal.h, ggml-vulkan.h, ggml-sycl.h, ...
│ └── gguf.h # GGUF reader/writer
├── src/
│ ├── ggml.c # Core compute graph + many CPU ops (~248 KB)
│ ├── ggml.cpp # Thin C++ wrappers
│ ├── ggml-alloc.c # Memory allocators (~48 KB)
│ ├── ggml-backend.cpp # Scheduler + cross-backend buffer copy (~93 KB)
│ ├── ggml-backend-reg.cpp # Backend registry (built-in + dynamic)
│ ├── ggml-backend-dl.cpp # Dynamic plugin loader
│ ├── ggml-backend-impl.h # Interface every backend implements
│ ├── ggml-backend-meta.cpp # Per-tensor metadata
│ ├── ggml-quants.c # Reference per-quant-type kernels (~222 KB)
│ ├── ggml-common.h # Block layouts shared with kernels (~135 KB)
│ ├── ggml-impl.h # Internal types
│ ├── ggml-opt.cpp # Adam, SGD, gradient bookkeeping
│ ├── ggml-threading.{cpp,h} # Thread sync primitives
│ ├── gguf.cpp # GGUF reader/writer (~53 KB)
│ ├── ggml-cpu/ # CPU backend
│ ├── ggml-cuda/ # CUDA backend
│ ├── ggml-metal/ # Metal backend
│ ├── ggml-vulkan/ # Vulkan backend
│ ├── ggml-sycl/ # SYCL backend
│ ├── ggml-opencl/ # OpenCL backend
│ ├── ggml-hip/ # AMD HIP wrapper around ggml-cuda
│ ├── ggml-musa/ # Moore Threads wrapper
│ ├── ggml-hexagon/ # Qualcomm Hexagon
│ ├── ggml-cann/ # Huawei CANN
│ ├── ggml-rpc/ # RPC backend
│ ├── ggml-webgpu/ # WebGPU
│ ├── ggml-openvino/ # OpenVINO
│ ├── ggml-zdnn/ # IBM Z
│ ├── ggml-zendnn/ # AMD ZenDNN
│ ├── ggml-virtgpu/ # virtio-gpu
│ └── ggml-blas/ # System BLAS shim
└── cmake/Key abstractions
| Type | Role | File |
|---|---|---|
ggml_tensor |
n-dim tensor with type, shape, strides, data pointer | ggml/include/ggml.h |
ggml_context |
Allocation arena | ggml/include/ggml.h |
ggml_cgraph |
A computation graph | ggml/include/ggml.h |
ggml_op |
The op enum (MUL_MAT, ROPE, ATTN_*, RMS_NORM, ...) |
ggml/include/ggml.h |
ggml_type |
Element type / quant format (F32, F16, BF16, Q4_K, IQ2_XS, MXFP4, ...) |
ggml/include/ggml.h |
ggml_backend_t |
A handle to one device on one backend | ggml/include/ggml-backend.h |
ggml_backend_buffer_t |
Memory region on a backend | ggml/include/ggml-backend.h |
ggml_backend_sched_t |
Multi-backend scheduler | ggml/include/ggml-backend.h |
gguf_context |
GGUF file handle | ggml/include/gguf.h |
How a graph runs
graph LR
A[ggml_init alloc context] --> B[Build tensors and ops]
B --> C[ggml_new_graph + ggml_build_forward_expand]
C --> D[ggml_backend_sched_alloc_graph]
D --> E[ggml_backend_sched_graph_compute]
E --> F[results in tensor->data]For inference, llama.cpp builds a fresh graph on every llama_decode (one per batch), schedules it, runs it, and reads logits out. See Computation graph.
GGUF
ggml/include/gguf.h and ggml/src/gguf.cpp are the in-tree GGUF reader/writer. They handle:
- Magic + version validation.
- Typed key/value metadata.
- Tensor headers (name, shape, type, offset).
- Lazy mmap of tensor bodies.
Python equivalent lives in gguf-py/. See gguf-py.
Optimizer
ggml-opt.cpp provides Adam / SGD plus gradient-tape machinery. Used by examples/training/ and the (now removed) standalone training tools. Inference does not use it.
Sync with the standalone repo
Changes to ggml/ here are mirrored to/from ggml-org/ggml using:
scripts/sync-ggml.shscripts/sync-ggml-am.shscripts/sync-ggml.last
The standalone repo holds extra examples (Whisper, GPT-2, MNIST) but otherwise tracks the same code.
Integration points
libllamais the largest in-tree consumer.- Backends. Each backend lives under
ggml/src/ggml-<name>/and ships its own kernels. - Tools.
llama-quantize,llama-perplexity, etc. all useggmlindirectly throughlibllama. - Conformance.
tests/test-backend-ops.cppis the ground truth for ggml correctness.
Entry points for modification
- New op. Add to
enum ggml_op, implement inggml.c(CPU reference) and at least one accelerator backend, register inggml-backend-impl.hop-support table, add atests/test-backend-ops.cppcase. - New
ggml_type. Editenum ggml_type, add the block layout toggml-common.h, implement reference kernels inggml-quants.c, then add per-backend specializations. - New backend. Subclass the interface in
ggml-backend-impl.h. The CUDA, Vulkan, and Metal backends are the canonical references, ranging from heaviest (CUDA) to most self-contained (Metal).
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
common
Next
gguf-py