ggml-org/llama.cpp
llama.cpp
llama.cpp is a C/C++ inference engine for large language models. It runs transformer-style models on commodity hardware — laptops, desktops, phones, edge devices, and servers — with support for CPU, NVIDIA, AMD, Apple Silicon, Intel, Qualcomm, Vulkan, WebGPU, and several other accelerator backends. The project is the upstream playground for the ggml tensor library and is the reference implementation for the GGUF model format.
The project ships:
- A C library (
libllama) with a stable public header atinclude/llama.h - A tensor compute library (
libggml) underggml/with one backend per accelerator family - A set of standalone command-line tools under
tools/(chat CLI, OpenAI-compatible HTTP server, quantizer, perplexity calculator, benchmarks, multimodal CLI, etc.) - Python conversion utilities (
convert_hf_to_gguf.py,gguf-py/) that turn HuggingFace checkpoints into GGUF files - Bindings, Docker images, and packaging shims for many languages and platforms
What this wiki covers
- Architecture — how the C library, tensor backends, and command-line tools fit together.
- Getting started — clone, build, download a model, run inference.
- Glossary — project-specific vocabulary (GGUF, imatrix, mtmd, KV cache, MoE, etc.).
- By the numbers — codebase size, churn, and history snapshot.
- Lore — the project's timeline and major eras.
- How to contribute — contributor workflow, testing, debugging, and conventions.
- Systems — internal building blocks of
libllama(model loader, KV cache, graph, sampler, vocab, grammar, chat). - Tools — the command-line binaries the project ships (
llama-cli,llama-server,llama-quantize,llama-bench,llama-mtmd-cli, ...). - Backends — the GGML compute backends (CPU, CUDA, Metal, Vulkan, SYCL, OpenCL, RPC, WebGPU, ...).
- Packages — the in-tree libraries (
common/,ggml/,gguf-py/). - API — the public C API surface.
- Reference — configuration, data models, and dependencies.
- Maintainers — subsystem ownership.
Who uses this
llama.cpp is consumed by:
- End users running models locally via the
llama-cliandllama-serverbinaries or the WebUI shipped undertools/server/webui - Library consumers linking against
libllamafrom C, C++, or via the third-party language bindings listed inREADME.md - Downstream projects that reuse only
libggmlfor tensor compute (Whisper.cpp, stable-diffusion.cpp, the wider GGML ecosystem)
License
llama.cpp is MIT-licensed. See LICENSE. The vendored third-party code under vendor/ carries its own licenses recorded in licenses/.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Next
Architecture