ggml-org/llama.cpp
CPU backend
Active contributors: Georgi Gerganov
The CPU backend is the reference implementation. Every other backend's correctness is judged against it (tests/test-backend-ops.cpp). It also handles fall-through ops that other backends haven't implemented yet, via cross-backend buffer copies in the scheduler.
Where it lives
ggml/src/
├── ggml.c # Core compute graph + many CPU op implementations
├── ggml-quants.c # Per-quant-type CPU dot products / dequant
├── ggml-cpu/
│ ├── CMakeLists.txt
│ ├── (architecture-specific dirs and files)
│ ├── ops/ # Per-op implementations
│ ├── llamafile/, x86/, arm/, riscv/, powerpc/, loongarch/, spacemit/, ...
│ └── ggml-cpu-impl.h # Internal typesggml.c and ggml-quants.c together are the largest pair of files in the repo (~248 KB and ~222 KB). They are kept as plain, portable C; per-architecture SIMD specializations live in subdirectories under ggml-cpu/.
SIMD targets
The CPU backend ships hand-tuned kernels for many ISAs:
| ISA | Notes |
|---|---|
| x86 AVX, AVX2, AVX512, AMX | Most desktop / server Intel and AMD silicon |
| ARM NEON / SVE / SVE2 | Apple Silicon, server ARM, mobile |
| Apple Accelerate | Used on macOS via ggml-blas/ for matmul |
| RISC-V RVV / ZVFH / ZFH / ZICBOP / ZIHINTPAUSE | Per README.md, supported with extensions |
| PowerPC VSX, LoongArch LSX, s390x vector | Per-architecture subdirs |
| SpacemiT (RISC-V) | ggml-cpu/spacemit/ |
CPU feature detection is runtime-based; the build links all the variants and the right one is chosen at startup.
Key abstractions
| Type | Role | File |
|---|---|---|
ggml_backend_cpu_init |
Constructor for the CPU backend | ggml/src/ggml-cpu/ggml-cpu.c |
ggml_compute_forward_* |
Per-op implementations | ggml/src/ggml.c, ggml-cpu/ops/ |
ggml_quantize_chunk / dequant tables |
Per-ggml_type blob conversions |
ggml/src/ggml-quants.c |
ggml_threadpool |
Lightweight in-house thread pool | ggml/src/ggml-cpu/ggml-cpu.c, ggml/src/ggml-threading.cpp |
llamafile_* SGEMM |
High-perf fp16/bf16/fp32 matmul kernels | ggml/src/ggml-cpu/llamafile/ |
Threading
ggml_threadpool is custom — it uses pthreads/Win32 threads with a low-overhead barrier and per-thread compute splits. CLI flags -t (threads), -tb (threads for batch processing), and --cpu-mask / --cpu-range map to threadpool configuration in common/arg.cpp.
The threadpool can be persistent: a binary that calls ggml_threadpool_new once and reuses it across many decodes avoids the thread-creation overhead on each call.
Optional integrations
- BLAS (
-DGGML_BLAS=ON) — link OpenBLAS / MKL / Accelerate for prompt-processing matmul. Seeggml/src/ggml-blas/. - llamafile SGEMM — vendored high-performance matmul, automatically used on supported targets.
- OpenMP — optional threading model on platforms that benefit from it (
-DGGML_OPENMP=ON).
Integration points
- Scheduler — CPU is the fallback for ops a chosen backend doesn't support; cross-backend buffer copies happen automatically.
- Quantization driver —
llama-quantizeruns CPU-only. - Tests —
test-backend-opsuses CPU as the reference.
Entry points for modification
- New op. Add CPU reference in
ggml.c(orggml-cpu/ops/). Add a test case intest-backend-ops.cpp. - New SIMD target. Add a directory under
ggml-cpu/<arch>/and gate with CMake feature checks. - Threadpool tuning.
ggml/src/ggml-cpu/ggml-cpu.candggml/src/ggml-threading.cpp.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Backends
Next
CUDA, HIP, MUSA backends