Factory.ai

Open-Source Wikis

/

llama.cpp

/

Backends

/

Metal backend

ggml-org/llama.cpp

Metal backend

Active contributors: Georgi Gerganov

The Metal backend runs on Apple Silicon (M-series chips) and Intel Macs with Metal-capable GPUs. It is on by default on macOS and is one of the project's most polished accelerator paths — Apple Silicon was a first-class target from very early in the project.

Where it lives

ggml/src/ggml-metal/
├── CMakeLists.txt
├── ggml-metal.h, ggml-metal.m, ggml-metal.mm   # Backend Objective-C++ entry
├── ggml-metal-impl.h                            # Internal declarations
├── ggml-metal.metal                             # All Metal compute shaders (~10k LOC)
├── ggml-metal-common.h
└── (ancillary helpers)

Most kernels live in a single big .metal file. The Objective-C side wraps the Metal API: device init, command queue, buffer management, and per-op dispatch.

Capabilities

  • Full transformer op set including flash attention.
  • All quant types relevant for inference (k-quants, IQ-quants, MXFP4, legacy block, FP16, BF16).
  • Quantized KV cache.
  • Unified memory architecture is exploited — large weights stay in shared memory rather than being copied across PCIe.
  • macOS, iOS/iPadOS, and Apple TV builds (via the build-xcframework.sh script).

Build

cmake -B build                     # Metal is auto-enabled on Apple platforms
cmake --build build --config Release -j

For iOS/iPadOS distribution, build-xcframework.sh produces an XCFramework. See docs/android.md and the SwiftUI demo at examples/llama.swiftui/ for mobile integration patterns.

Performance notes

  • Apple Silicon has unified memory: load a model and the GPU sees the same bytes — no cudaMemcpy equivalent required.
  • M-series Neural Engine is not used — the backend runs on the GPU. Some prompt-processing paths use the BLAS backend (Apple Accelerate) when CMake finds it.
  • -fa 1 enables flash attention.
  • -ctk q8_0 -ctv q8_0 halves KV-cache memory at small quality cost.

Integration points

  • Scheduler. Single-device by default; multi-Metal-device setups are uncommon but supported.
  • build-xcframework.sh — produces a redistributable XCFramework for iOS/macOS apps.
  • examples/llama.swiftui/, examples/batched.swift/ — Swift integration references.

Entry points for modification

  • New shader. Add a kernel to ggml-metal.metal, declare it in ggml-metal-impl.h, dispatch in ggml-metal.m. Test against CPU via tests/test-backend-ops.
  • API change. Editing the Obj-C++ .m/.mm files; keep them ARC-clean.
  • iOS-specific. build-xcframework.sh is the single source of truth for the iOS build; changes go there.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Metal backend – llama.cpp wiki | Factory