Factory.ai

Open-Source Wikis

/

llama.cpp

/

Backends

ggml-org/llama.cpp

Backends

A "backend" in llama.cpp is a ggml_backend_* implementation that knows how to allocate buffers, schedule compute, and run kernels for a particular accelerator family. The CPU backend is always present; everything else is opt-in via CMake flags. The runtime registry in ggml-backend-reg.cpp discovers loaded backends, and ggml-backend-sched decides which tensors run where.

Per-backend build instructions and capability matrices live under docs/backend/ and docs/build.md. This wiki summarizes the architecture.

Pages

Backend registry

Every backend exports a ggml_backend_reg_t describing its devices. ggml_backend_load_all walks the linked-in or dynamically-loaded set and registers them. The same libllama build can therefore pick its accelerator at runtime — useful for ditrobutable binaries that target multiple GPUs.

Key files:

File Role
ggml/include/ggml-backend.h Public backend API
ggml/src/ggml-backend.cpp Scheduler + cross-backend buffer copy
ggml/src/ggml-backend-reg.cpp Built-in registry, env-var overrides
ggml/src/ggml-backend-dl.cpp Dynamic plugin loader for shared-lib backends
ggml/src/ggml-backend-impl.h Interface that each backend implements
ggml/src/ggml-backend-meta.cpp Per-tensor metadata used by the scheduler

Scheduler (ggml_backend_sched)

The scheduler in ggml/src/ggml-backend.cpp owns one buffer per backend and decides per-graph-node which backend executes it. The decision rules are roughly:

  1. Tensors explicitly placed by the user (e.g. via --tensor-split) stay there.
  2. Children inherit their parent's backend unless cost analysis suggests moving them.
  3. Inputs that don't fit in the chosen backend's buffer are split via cross-backend copies.

This is what enables CPU+GPU hybrid inference: large weights live on the GPU, smaller per-batch tensors flow through CPU when GPU memory is full.

graph LR
    Graph[ggml_cgraph] --> Sched[ggml_backend_sched]
    Sched --> Split[per-node backend assignment]
    Split --> CPU[ggml-cpu]
    Split --> GPU1[ggml-cuda / metal / vulkan / ...]
    CPU -->|buffer copy when needed| GPU1
    GPU1 -->|buffer copy when needed| CPU

CMake flags

The complete list is in ggml/CMakeLists.txt. The most common:

Flag Effect
-DGGML_CUDA=ON NVIDIA CUDA
-DGGML_HIP=ON AMD HIP/ROCm (uses CUDA sources via ggml/src/ggml-hip/)
-DGGML_MUSA=ON Moore Threads MUSA
-DGGML_METAL=ON Apple Metal (default on macOS)
-DGGML_VULKAN=ON Vulkan
-DGGML_SYCL=ON Intel oneAPI / SYCL
-DGGML_OPENCL=ON OpenCL (Qualcomm Adreno path)
-DGGML_HEXAGON=ON Qualcomm Hexagon DSP
-DGGML_CANN=ON Huawei Ascend CANN
-DGGML_RPC=ON RPC client + server
-DGGML_BLAS=ON System BLAS for prompt processing
-DGGML_OPENVINO=ON Intel OpenVINO
-DGGML_WEBGPU=ON WebGPU
-DGGML_ZDNN=ON IBM Z

BUILD_SHARED_LIBS=ON produces one shared library per backend so they can be loaded dynamically. The default is to link the selected backends statically into libggml.

Backend ops conformance

tests/test-backend-ops.cpp is the canonical conformance suite. It runs every ggml_op through every loaded backend and compares against the CPU reference. PRs that add or change kernel code are expected to run this test against at least two backends.

docs/ops.md (auto-generated by examples/gen-docs/) is the per-op coverage table — which backends support which ops at which precisions.

Code ownership

Backend ownership is split across CODEOWNERS groups:

Backend Owners group
ggml-cuda @ggml-org/ggml-cuda (JohannesGaessler, am17an, IMbackK, ORippler)
ggml-metal @ggml-org/ggml-metal (ggerganov)
ggml-vulkan @ggml-org/ggml-vulkan (0cc4m, jeffbolznv)
ggml-sycl @ggml-org/ggml-sycl (arthw)
ggml-opencl @ggml-org/ggml-opencl (lhez, max-krasnyansky)
ggml-hexagon @ggml-org/ggml-hexagon (lhez, max-krasnyansky)
ggml-cann @ggml-org/ggml-cann (hipudding)
ggml-rpc @ggml-org/ggml-rpc (rgerganov)
ggml-webgpu @ggml-org/ggml-webgpu (reeselevine)
ggml-zdnn @ggml-org/ggml-zdnn (taronaeo, Andreas-Krebbel, AlekseiNikiforovIBM)
ggml-openvino cavusmustafa, wine99
ggml-virtgpu kpouget
ggml-cpu @ggerganov
ggml-cpu/spacemit @alex-spacemit

For ownership of libllama itself see Maintainers.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Backends – llama.cpp wiki | Factory