ggml-org/llama.cpp

Vulkan backend

Active contributors: 0cc4m, Jeff Bolz

The Vulkan backend is the cross-vendor GPU path: it runs on NVIDIA, AMD, Intel, and any other Vulkan-capable hardware with a single binary. It is the most portable GPU option, especially on Linux and Windows where vendor stacks differ.

Where it lives

ggml/src/ggml-vulkan/
├── CMakeLists.txt
├── ggml-vulkan.cpp        # Backend entry: device, queues, dispatch, buffers
├── ggml-vulkan-shaders.cpp / .hpp  # Generated shader registry
├── vulkan-shaders/        # GLSL/SPIR-V kernel sources
└── (cmake fragments that compile shaders at build time)

Shader sources are compiled to SPIR-V at build time (glslc from the Vulkan SDK) and embedded into the binary. The C++ side handles device discovery, descriptor sets, command buffers, and pipeline creation.

Capabilities

Full transformer op set on most modern GPUs.
All ggml_types used at inference time (k-quants, IQ-quants, MXFP4, legacy block, FP16, BF16).
Multi-GPU via tensor split.
Shader-based flash attention path on supporting GPUs.
Cross-vendor: same binary runs on NVIDIA, AMD, Intel, Adreno (mobile), Mali, etc.

Build

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j

Requires the Vulkan SDK at build time. Detailed instructions in docs/build.md#vulkan and docs/backend/.

Performance notes

Vulkan competes well with vendor-specific backends on flagship GPUs and is often the best option on AMD due to ROCm's narrower hardware coverage.
On older / mobile Vulkan implementations, expect slower flash-attention paths or fallbacks.
Pipeline cache files (pipeline_cache.bin) speed up subsequent launches; the backend writes them under the user's cache dir by default.

Integration points

Scheduler. Standard ggml_backend_sched integration.
Server. Same flags as CUDA: --n-gpu-layers, --tensor-split, --main-gpu, -fa.
Mobile. Vulkan is increasingly viable on Adreno/Mali; see also the Hexagon backend for Qualcomm DSP offload.

Entry points for modification

New shader. Add vulkan-shaders/<op>.comp (GLSL), wire up the dispatch in ggml-vulkan.cpp. The shader registry is generated at build time.
Pipeline tuning. ggml-vulkan.cpp owns the pipeline-creation logic and cache.
New device extension. Probe in ggml_backend_vulkan_init, gate kernels accordingly.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.