ggml-org/llama.cpp
CUDA, HIP, MUSA backends
Active contributors: Johannes Gäßler, am17an, IMbackK, ORippler
The CUDA backend is the largest GPU backend by file count and the one most contributors interact with. It also doubles as the source for the AMD HIP/ROCm backend (which wraps CUDA sources with HIP shims) and for the Moore Threads MUSA backend.
Where it lives
ggml/src/
├── ggml-cuda/ # The backend itself
│ ├── CMakeLists.txt
│ ├── ggml-cuda.cu # Backend entry, dispatch
│ ├── common.cuh # Shared device helpers
│ ├── *.cu, *.cuh # Per-op kernels (mmvq, mmq, rope, attn, fattn-*, norm, ...)
│ ├── template-instances/ # Explicit per-quant-type instantiations
│ ├── vendors/ # vendor/hip.h, musa.h shims
│ └── fattn-wmma* # WMMA-based flash attention paths
├── ggml-hip/ # AMD HIP wrapper
└── ggml-musa/ # Moore Threads wrapperThe HIP and MUSA wrappers compile the same .cu sources with a vendor shim layer (ggml-cuda/vendors/).
Key abstractions
| Type | Role |
|---|---|
ggml_backend_cuda_init(int device) |
Construct a CUDA backend for a specific device |
ggml_cuda_context |
Per-device state (streams, scratch buffers) |
mmvq / mmq kernels |
Quantized matmul (vector × matrix and matrix × matrix) |
fattn-* |
Fused flash-attention kernels |
| Stream pool | Multi-stream concurrent execution |
Capabilities
- All
ggml_ops used by the standard transformer graph. - Quantized matmul for every supported
ggml_type(legacy block, k-quants, IQ-quants, MXFP4, FP8 where applicable). - Flash attention (
-fa 1). - KV-cache quantization (
-ctk q8_0 -ctv q8_0, etc.). - Multi-GPU via tensor split (
-ts a,b,...) and pipeline parallel via layer placement (-ngl N). - Asynchronous copies between host pinned memory and device.
Build flags
cmake -B build -DGGML_CUDA=ON # NVIDIA
cmake -B build -DGGML_HIP=ON # AMD ROCm
cmake -B build -DGGML_MUSA=ON # Moore ThreadsImportant sub-flags:
-DCMAKE_CUDA_ARCHITECTURES=...— target SM list.-DGGML_CUDA_FORCE_MMQ=ON— force the MMQ matmul path.-DGGML_CUDA_F16=ON— enable FP16 paths.-DGGML_CUDA_FA_ALL_QUANTS=ON— compile flash attention for every quant type.
For the long form, see docs/build.md and docs/backend/CUDA-FEDORA.md, docs/backend/HIP.md, docs/backend/MUSA.md.
Multi-GPU
-ts a,b,... splits weights across devices proportionally; -mg N picks the "main" device (where small intermediate tensors land). Pipeline behavior is the default; -sm row switches to row-parallel matmul. Cross-GPU communication uses peer access where supported, falling back to host-pinned copies otherwise.
Integration points
- Scheduler. Like every backend, the CUDA backend exposes buffer types and op support to
ggml_backend_sched. - HIP / MUSA. Built from the same sources via the vendor shims; cross-checked by
tests/test-backend-opswhen both are present. - Server.
tools/serveris the heaviest user;--n-gpu-layers,--tensor-split,--main-gpu,-fa,-ctk,-ctvare exposed throughcommon/arg.cpp.
Entry points for modification
- New kernel. Add
ggml/src/ggml-cuda/<op>.cuplus.cuhheader, register in the dispatch table insideggml-cuda.cu, add a test intests/test-backend-ops.cpp. - New quant support. Add an instantiation under
ggml-cuda/template-instances/and update the MMQ / MMVQ dispatch. - HIP-only fix. Use
vendors/hip.hto redirect or stub APIs; avoid forking the source files.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
CPU backend
Next
Metal backend