ggml-org/llama.cpp
Backends
A "backend" in llama.cpp is a ggml_backend_* implementation that knows how to allocate buffers, schedule compute, and run kernels for a particular accelerator family. The CPU backend is always present; everything else is opt-in via CMake flags. The runtime registry in ggml-backend-reg.cpp discovers loaded backends, and ggml-backend-sched decides which tensors run where.
Per-backend build instructions and capability matrices live under docs/backend/ and docs/build.md. This wiki summarizes the architecture.
Pages
- CPU backend —
ggml-cpu/, the always-present reference. - CUDA / HIP / MUSA — NVIDIA, AMD, and Moore Threads.
- Metal — Apple Silicon.
- Vulkan — cross-vendor GPU.
- SYCL / OpenCL / OpenVINO — Intel and friends.
- Other backends — Hexagon, CANN, RPC, WebGPU, BLAS, zDNN, Zendnn, virtgpu.
Backend registry
Every backend exports a ggml_backend_reg_t describing its devices. ggml_backend_load_all walks the linked-in or dynamically-loaded set and registers them. The same libllama build can therefore pick its accelerator at runtime — useful for ditrobutable binaries that target multiple GPUs.
Key files:
| File | Role |
|---|---|
ggml/include/ggml-backend.h |
Public backend API |
ggml/src/ggml-backend.cpp |
Scheduler + cross-backend buffer copy |
ggml/src/ggml-backend-reg.cpp |
Built-in registry, env-var overrides |
ggml/src/ggml-backend-dl.cpp |
Dynamic plugin loader for shared-lib backends |
ggml/src/ggml-backend-impl.h |
Interface that each backend implements |
ggml/src/ggml-backend-meta.cpp |
Per-tensor metadata used by the scheduler |
Scheduler (ggml_backend_sched)
The scheduler in ggml/src/ggml-backend.cpp owns one buffer per backend and decides per-graph-node which backend executes it. The decision rules are roughly:
- Tensors explicitly placed by the user (e.g. via
--tensor-split) stay there. - Children inherit their parent's backend unless cost analysis suggests moving them.
- Inputs that don't fit in the chosen backend's buffer are split via cross-backend copies.
This is what enables CPU+GPU hybrid inference: large weights live on the GPU, smaller per-batch tensors flow through CPU when GPU memory is full.
graph LR
Graph[ggml_cgraph] --> Sched[ggml_backend_sched]
Sched --> Split[per-node backend assignment]
Split --> CPU[ggml-cpu]
Split --> GPU1[ggml-cuda / metal / vulkan / ...]
CPU -->|buffer copy when needed| GPU1
GPU1 -->|buffer copy when needed| CPUCMake flags
The complete list is in ggml/CMakeLists.txt. The most common:
| Flag | Effect |
|---|---|
-DGGML_CUDA=ON |
NVIDIA CUDA |
-DGGML_HIP=ON |
AMD HIP/ROCm (uses CUDA sources via ggml/src/ggml-hip/) |
-DGGML_MUSA=ON |
Moore Threads MUSA |
-DGGML_METAL=ON |
Apple Metal (default on macOS) |
-DGGML_VULKAN=ON |
Vulkan |
-DGGML_SYCL=ON |
Intel oneAPI / SYCL |
-DGGML_OPENCL=ON |
OpenCL (Qualcomm Adreno path) |
-DGGML_HEXAGON=ON |
Qualcomm Hexagon DSP |
-DGGML_CANN=ON |
Huawei Ascend CANN |
-DGGML_RPC=ON |
RPC client + server |
-DGGML_BLAS=ON |
System BLAS for prompt processing |
-DGGML_OPENVINO=ON |
Intel OpenVINO |
-DGGML_WEBGPU=ON |
WebGPU |
-DGGML_ZDNN=ON |
IBM Z |
BUILD_SHARED_LIBS=ON produces one shared library per backend so they can be loaded dynamically. The default is to link the selected backends statically into libggml.
Backend ops conformance
tests/test-backend-ops.cpp is the canonical conformance suite. It runs every ggml_op through every loaded backend and compares against the CPU reference. PRs that add or change kernel code are expected to run this test against at least two backends.
docs/ops.md (auto-generated by examples/gen-docs/) is the per-op coverage table — which backends support which ops at which precisions.
Code ownership
Backend ownership is split across CODEOWNERS groups:
| Backend | Owners group |
|---|---|
ggml-cuda |
@ggml-org/ggml-cuda (JohannesGaessler, am17an, IMbackK, ORippler) |
ggml-metal |
@ggml-org/ggml-metal (ggerganov) |
ggml-vulkan |
@ggml-org/ggml-vulkan (0cc4m, jeffbolznv) |
ggml-sycl |
@ggml-org/ggml-sycl (arthw) |
ggml-opencl |
@ggml-org/ggml-opencl (lhez, max-krasnyansky) |
ggml-hexagon |
@ggml-org/ggml-hexagon (lhez, max-krasnyansky) |
ggml-cann |
@ggml-org/ggml-cann (hipudding) |
ggml-rpc |
@ggml-org/ggml-rpc (rgerganov) |
ggml-webgpu |
@ggml-org/ggml-webgpu (reeselevine) |
ggml-zdnn |
@ggml-org/ggml-zdnn (taronaeo, Andreas-Krebbel, AlekseiNikiforovIBM) |
ggml-openvino |
cavusmustafa, wine99 |
ggml-virtgpu |
kpouget |
ggml-cpu |
@ggerganov |
ggml-cpu/spacemit |
@alex-spacemit |
For ownership of libllama itself see Maintainers.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Other tools
Next
CPU backend