Factory.ai

Open-Source Wikis

/

llama.cpp

/

Backends

/

CPU backend

ggml-org/llama.cpp

CPU backend

Active contributors: Georgi Gerganov

The CPU backend is the reference implementation. Every other backend's correctness is judged against it (tests/test-backend-ops.cpp). It also handles fall-through ops that other backends haven't implemented yet, via cross-backend buffer copies in the scheduler.

Where it lives

ggml/src/
├── ggml.c                  # Core compute graph + many CPU op implementations
├── ggml-quants.c           # Per-quant-type CPU dot products / dequant
├── ggml-cpu/
│   ├── CMakeLists.txt
│   ├── (architecture-specific dirs and files)
│   ├── ops/                # Per-op implementations
│   ├── llamafile/, x86/, arm/, riscv/, powerpc/, loongarch/, spacemit/, ...
│   └── ggml-cpu-impl.h     # Internal types

ggml.c and ggml-quants.c together are the largest pair of files in the repo (~248 KB and ~222 KB). They are kept as plain, portable C; per-architecture SIMD specializations live in subdirectories under ggml-cpu/.

SIMD targets

The CPU backend ships hand-tuned kernels for many ISAs:

ISA Notes
x86 AVX, AVX2, AVX512, AMX Most desktop / server Intel and AMD silicon
ARM NEON / SVE / SVE2 Apple Silicon, server ARM, mobile
Apple Accelerate Used on macOS via ggml-blas/ for matmul
RISC-V RVV / ZVFH / ZFH / ZICBOP / ZIHINTPAUSE Per README.md, supported with extensions
PowerPC VSX, LoongArch LSX, s390x vector Per-architecture subdirs
SpacemiT (RISC-V) ggml-cpu/spacemit/

CPU feature detection is runtime-based; the build links all the variants and the right one is chosen at startup.

Key abstractions

Type Role File
ggml_backend_cpu_init Constructor for the CPU backend ggml/src/ggml-cpu/ggml-cpu.c
ggml_compute_forward_* Per-op implementations ggml/src/ggml.c, ggml-cpu/ops/
ggml_quantize_chunk / dequant tables Per-ggml_type blob conversions ggml/src/ggml-quants.c
ggml_threadpool Lightweight in-house thread pool ggml/src/ggml-cpu/ggml-cpu.c, ggml/src/ggml-threading.cpp
llamafile_* SGEMM High-perf fp16/bf16/fp32 matmul kernels ggml/src/ggml-cpu/llamafile/

Threading

ggml_threadpool is custom — it uses pthreads/Win32 threads with a low-overhead barrier and per-thread compute splits. CLI flags -t (threads), -tb (threads for batch processing), and --cpu-mask / --cpu-range map to threadpool configuration in common/arg.cpp.

The threadpool can be persistent: a binary that calls ggml_threadpool_new once and reuses it across many decodes avoids the thread-creation overhead on each call.

Optional integrations

  • BLAS (-DGGML_BLAS=ON) — link OpenBLAS / MKL / Accelerate for prompt-processing matmul. See ggml/src/ggml-blas/.
  • llamafile SGEMM — vendored high-performance matmul, automatically used on supported targets.
  • OpenMP — optional threading model on platforms that benefit from it (-DGGML_OPENMP=ON).

Integration points

  • Scheduler — CPU is the fallback for ops a chosen backend doesn't support; cross-backend buffer copies happen automatically.
  • Quantization driverllama-quantize runs CPU-only.
  • Teststest-backend-ops uses CPU as the reference.

Entry points for modification

  • New op. Add CPU reference in ggml.c (or ggml-cpu/ops/). Add a test case in test-backend-ops.cpp.
  • New SIMD target. Add a directory under ggml-cpu/<arch>/ and gate with CMake feature checks.
  • Threadpool tuning. ggml/src/ggml-cpu/ggml-cpu.c and ggml/src/ggml-threading.cpp.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

CPU backend – llama.cpp wiki | Factory