ggml-org/llama.cpp

llama.cpp

llama.cpp is a C/C++ inference engine for large language models. It runs transformer-style models on commodity hardware — laptops, desktops, phones, edge devices, and servers — with support for CPU, NVIDIA, AMD, Apple Silicon, Intel, Qualcomm, Vulkan, WebGPU, and several other accelerator backends. The project is the upstream playground for the ggml tensor library and is the reference implementation for the GGUF model format.

The project ships:

A C library (libllama) with a stable public header at include/llama.h
A tensor compute library (libggml) under ggml/ with one backend per accelerator family
A set of standalone command-line tools under tools/ (chat CLI, OpenAI-compatible HTTP server, quantizer, perplexity calculator, benchmarks, multimodal CLI, etc.)
Python conversion utilities (convert_hf_to_gguf.py, gguf-py/) that turn HuggingFace checkpoints into GGUF files
Bindings, Docker images, and packaging shims for many languages and platforms

What this wiki covers

Architecture — how the C library, tensor backends, and command-line tools fit together.
Getting started — clone, build, download a model, run inference.
Glossary — project-specific vocabulary (GGUF, imatrix, mtmd, KV cache, MoE, etc.).
By the numbers — codebase size, churn, and history snapshot.
Lore — the project's timeline and major eras.
How to contribute — contributor workflow, testing, debugging, and conventions.
Systems — internal building blocks of libllama (model loader, KV cache, graph, sampler, vocab, grammar, chat).
Tools — the command-line binaries the project ships (llama-cli, llama-server, llama-quantize, llama-bench, llama-mtmd-cli, ...).
Backends — the GGML compute backends (CPU, CUDA, Metal, Vulkan, SYCL, OpenCL, RPC, WebGPU, ...).
Packages — the in-tree libraries (common/, ggml/, gguf-py/).
API — the public C API surface.
Reference — configuration, data models, and dependencies.
Maintainers — subsystem ownership.

Who uses this

llama.cpp is consumed by:

End users running models locally via the llama-cli and llama-server binaries or the WebUI shipped under tools/server/webui
Library consumers linking against libllama from C, C++, or via the third-party language bindings listed in README.md
Downstream projects that reuse only libggml for tensor compute (Whisper.cpp, stable-diffusion.cpp, the wider GGML ecosystem)

License

llama.cpp is MIT-licensed. See LICENSE. The vendored third-party code under vendor/ carries its own licenses recorded in licenses/.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.