ggml-org/llama.cpp

CPU backend

Active contributors: Georgi Gerganov

The CPU backend is the reference implementation. Every other backend's correctness is judged against it (tests/test-backend-ops.cpp). It also handles fall-through ops that other backends haven't implemented yet, via cross-backend buffer copies in the scheduler.

Where it lives

ggml/src/
├── ggml.c                  # Core compute graph + many CPU op implementations
├── ggml-quants.c           # Per-quant-type CPU dot products / dequant
├── ggml-cpu/
│   ├── CMakeLists.txt
│   ├── (architecture-specific dirs and files)
│   ├── ops/                # Per-op implementations
│   ├── llamafile/, x86/, arm/, riscv/, powerpc/, loongarch/, spacemit/, ...
│   └── ggml-cpu-impl.h     # Internal types

ggml.c and ggml-quants.c together are the largest pair of files in the repo (~248 KB and ~222 KB). They are kept as plain, portable C; per-architecture SIMD specializations live in subdirectories under ggml-cpu/.

SIMD targets

The CPU backend ships hand-tuned kernels for many ISAs:

ISA	Notes
x86 AVX, AVX2, AVX512, AMX	Most desktop / server Intel and AMD silicon
ARM NEON / SVE / SVE2	Apple Silicon, server ARM, mobile
Apple Accelerate	Used on macOS via `ggml-blas/` for matmul
RISC-V RVV / ZVFH / ZFH / ZICBOP / ZIHINTPAUSE	Per `README.md`, supported with extensions
PowerPC VSX, LoongArch LSX, s390x vector	Per-architecture subdirs
SpacemiT (RISC-V)	`ggml-cpu/spacemit/`

CPU feature detection is runtime-based; the build links all the variants and the right one is chosen at startup.

Key abstractions

Type	Role	File
`ggml_backend_cpu_init`	Constructor for the CPU backend	`ggml/src/ggml-cpu/ggml-cpu.c`
`ggml_compute_forward_*`	Per-op implementations	`ggml/src/ggml.c`, `ggml-cpu/ops/`
`ggml_quantize_chunk` / dequant tables	Per-`ggml_type` blob conversions	`ggml/src/ggml-quants.c`
`ggml_threadpool`	Lightweight in-house thread pool	`ggml/src/ggml-cpu/ggml-cpu.c`, `ggml/src/ggml-threading.cpp`
`llamafile_*` SGEMM	High-perf fp16/bf16/fp32 matmul kernels	`ggml/src/ggml-cpu/llamafile/`

Threading

ggml_threadpool is custom — it uses pthreads/Win32 threads with a low-overhead barrier and per-thread compute splits. CLI flags -t (threads), -tb (threads for batch processing), and --cpu-mask / --cpu-range map to threadpool configuration in common/arg.cpp.

The threadpool can be persistent: a binary that calls ggml_threadpool_new once and reuses it across many decodes avoids the thread-creation overhead on each call.

Optional integrations

BLAS (-DGGML_BLAS=ON) — link OpenBLAS / MKL / Accelerate for prompt-processing matmul. See ggml/src/ggml-blas/.
llamafile SGEMM — vendored high-performance matmul, automatically used on supported targets.
OpenMP — optional threading model on platforms that benefit from it (-DGGML_OPENMP=ON).

Integration points

Scheduler — CPU is the fallback for ops a chosen backend doesn't support; cross-backend buffer copies happen automatically.
Quantization driver — llama-quantize runs CPU-only.
Tests — test-backend-ops uses CPU as the reference.

Entry points for modification

New op. Add CPU reference in ggml.c (or ggml-cpu/ops/). Add a test case in test-backend-ops.cpp.
New SIMD target. Add a directory under ggml-cpu/<arch>/ and gate with CMake feature checks.
Threadpool tuning. ggml/src/ggml-cpu/ggml-cpu.c and ggml/src/ggml-threading.cpp.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.