ggml-org/llama.cpp

Backends

A "backend" in llama.cpp is a ggml_backend_* implementation that knows how to allocate buffers, schedule compute, and run kernels for a particular accelerator family. The CPU backend is always present; everything else is opt-in via CMake flags. The runtime registry in ggml-backend-reg.cpp discovers loaded backends, and ggml-backend-sched decides which tensors run where.

Per-backend build instructions and capability matrices live under docs/backend/ and docs/build.md. This wiki summarizes the architecture.

Backend registry

Every backend exports a ggml_backend_reg_t describing its devices. ggml_backend_load_all walks the linked-in or dynamically-loaded set and registers them. The same libllama build can therefore pick its accelerator at runtime — useful for ditrobutable binaries that target multiple GPUs.

Key files:

File	Role
`ggml/include/ggml-backend.h`	Public backend API
`ggml/src/ggml-backend.cpp`	Scheduler + cross-backend buffer copy
`ggml/src/ggml-backend-reg.cpp`	Built-in registry, env-var overrides
`ggml/src/ggml-backend-dl.cpp`	Dynamic plugin loader for shared-lib backends
`ggml/src/ggml-backend-impl.h`	Interface that each backend implements
`ggml/src/ggml-backend-meta.cpp`	Per-tensor metadata used by the scheduler

Scheduler (`ggml_backend_sched`)

The scheduler in ggml/src/ggml-backend.cpp owns one buffer per backend and decides per-graph-node which backend executes it. The decision rules are roughly:

Tensors explicitly placed by the user (e.g. via --tensor-split) stay there.
Children inherit their parent's backend unless cost analysis suggests moving them.
Inputs that don't fit in the chosen backend's buffer are split via cross-backend copies.

This is what enables CPU+GPU hybrid inference: large weights live on the GPU, smaller per-batch tensors flow through CPU when GPU memory is full.

graph LR
    Graph[ggml_cgraph] --> Sched[ggml_backend_sched]
    Sched --> Split[per-node backend assignment]
    Split --> CPU[ggml-cpu]
    Split --> GPU1[ggml-cuda / metal / vulkan / ...]
    CPU -->|buffer copy when needed| GPU1
    GPU1 -->|buffer copy when needed| CPU

CMake flags

The complete list is in ggml/CMakeLists.txt. The most common:

Flag	Effect
`-DGGML_CUDA=ON`	NVIDIA CUDA
`-DGGML_HIP=ON`	AMD HIP/ROCm (uses CUDA sources via `ggml/src/ggml-hip/`)
`-DGGML_MUSA=ON`	Moore Threads MUSA
`-DGGML_METAL=ON`	Apple Metal (default on macOS)
`-DGGML_VULKAN=ON`	Vulkan
`-DGGML_SYCL=ON`	Intel oneAPI / SYCL
`-DGGML_OPENCL=ON`	OpenCL (Qualcomm Adreno path)
`-DGGML_HEXAGON=ON`	Qualcomm Hexagon DSP
`-DGGML_CANN=ON`	Huawei Ascend CANN
`-DGGML_RPC=ON`	RPC client + server
`-DGGML_BLAS=ON`	System BLAS for prompt processing
`-DGGML_OPENVINO=ON`	Intel OpenVINO
`-DGGML_WEBGPU=ON`	WebGPU
`-DGGML_ZDNN=ON`	IBM Z

BUILD_SHARED_LIBS=ON produces one shared library per backend so they can be loaded dynamically. The default is to link the selected backends statically into libggml.

Backend ops conformance

tests/test-backend-ops.cpp is the canonical conformance suite. It runs every ggml_op through every loaded backend and compares against the CPU reference. PRs that add or change kernel code are expected to run this test against at least two backends.

docs/ops.md (auto-generated by examples/gen-docs/) is the per-op coverage table — which backends support which ops at which precisions.

Code ownership

Backend ownership is split across CODEOWNERS groups:

Backend	Owners group
`ggml-cuda`	@ggml-org/ggml-cuda (JohannesGaessler, am17an, IMbackK, ORippler)
`ggml-metal`	@ggml-org/ggml-metal (ggerganov)
`ggml-vulkan`	@ggml-org/ggml-vulkan (0cc4m, jeffbolznv)
`ggml-sycl`	@ggml-org/ggml-sycl (arthw)
`ggml-opencl`	@ggml-org/ggml-opencl (lhez, max-krasnyansky)
`ggml-hexagon`	@ggml-org/ggml-hexagon (lhez, max-krasnyansky)
`ggml-cann`	@ggml-org/ggml-cann (hipudding)
`ggml-rpc`	@ggml-org/ggml-rpc (rgerganov)
`ggml-webgpu`	@ggml-org/ggml-webgpu (reeselevine)
`ggml-zdnn`	@ggml-org/ggml-zdnn (taronaeo, Andreas-Krebbel, AlekseiNikiforovIBM)
`ggml-openvino`	cavusmustafa, wine99
`ggml-virtgpu`	kpouget
`ggml-cpu`	@ggerganov
`ggml-cpu/spacemit`	@alex-spacemit

For ownership of libllama itself see Maintainers.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Pages

Backend registry

Scheduler (ggml_backend_sched)

CMake flags

Backend ops conformance

Code ownership

Scheduler (`ggml_backend_sched`)