ggml-org/llama.cpp

CUDA, HIP, MUSA backends

Active contributors: Johannes Gäßler, am17an, IMbackK, ORippler

The CUDA backend is the largest GPU backend by file count and the one most contributors interact with. It also doubles as the source for the AMD HIP/ROCm backend (which wraps CUDA sources with HIP shims) and for the Moore Threads MUSA backend.

Where it lives

ggml/src/
├── ggml-cuda/                  # The backend itself
│   ├── CMakeLists.txt
│   ├── ggml-cuda.cu            # Backend entry, dispatch
│   ├── common.cuh              # Shared device helpers
│   ├── *.cu, *.cuh             # Per-op kernels (mmvq, mmq, rope, attn, fattn-*, norm, ...)
│   ├── template-instances/     # Explicit per-quant-type instantiations
│   ├── vendors/                # vendor/hip.h, musa.h shims
│   └── fattn-wmma*             # WMMA-based flash attention paths
├── ggml-hip/                   # AMD HIP wrapper
└── ggml-musa/                  # Moore Threads wrapper

The HIP and MUSA wrappers compile the same .cu sources with a vendor shim layer (ggml-cuda/vendors/).

Key abstractions

Type	Role
`ggml_backend_cuda_init(int device)`	Construct a CUDA backend for a specific device
`ggml_cuda_context`	Per-device state (streams, scratch buffers)
`mmvq` / `mmq` kernels	Quantized matmul (vector × matrix and matrix × matrix)
`fattn-*`	Fused flash-attention kernels
Stream pool	Multi-stream concurrent execution

Capabilities

All ggml_ops used by the standard transformer graph.
Quantized matmul for every supported ggml_type (legacy block, k-quants, IQ-quants, MXFP4, FP8 where applicable).
Flash attention (-fa 1).
KV-cache quantization (-ctk q8_0 -ctv q8_0, etc.).
Multi-GPU via tensor split (-ts a,b,...) and pipeline parallel via layer placement (-ngl N).
Asynchronous copies between host pinned memory and device.

Build flags

cmake -B build -DGGML_CUDA=ON               # NVIDIA
cmake -B build -DGGML_HIP=ON                # AMD ROCm
cmake -B build -DGGML_MUSA=ON               # Moore Threads

Important sub-flags:

-DCMAKE_CUDA_ARCHITECTURES=... — target SM list.
-DGGML_CUDA_FORCE_MMQ=ON — force the MMQ matmul path.
-DGGML_CUDA_F16=ON — enable FP16 paths.
-DGGML_CUDA_FA_ALL_QUANTS=ON — compile flash attention for every quant type.

For the long form, see docs/build.md and docs/backend/CUDA-FEDORA.md, docs/backend/HIP.md, docs/backend/MUSA.md.

Multi-GPU

-ts a,b,... splits weights across devices proportionally; -mg N picks the "main" device (where small intermediate tensors land). Pipeline behavior is the default; -sm row switches to row-parallel matmul. Cross-GPU communication uses peer access where supported, falling back to host-pinned copies otherwise.

Integration points

Scheduler. Like every backend, the CUDA backend exposes buffer types and op support to ggml_backend_sched.
HIP / MUSA. Built from the same sources via the vendor shims; cross-checked by tests/test-backend-ops when both are present.
Server. tools/server is the heaviest user; --n-gpu-layers, --tensor-split, --main-gpu, -fa, -ctk, -ctv are exposed through common/arg.cpp.

Entry points for modification

New kernel. Add ggml/src/ggml-cuda/<op>.cu plus .cuh header, register in the dispatch table inside ggml-cuda.cu, add a test in tests/test-backend-ops.cpp.
New quant support. Add an instantiation under ggml-cuda/template-instances/ and update the MMQ / MMVQ dispatch.
HIP-only fix. Use vendors/hip.h to redirect or stub APIs; avoid forking the source files.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.