ggml-org/llama.cpp
Lore
The story of llama.cpp is the story of one repository that quickly became the standard way to run large language models on consumer hardware. What started as a Sunday-afternoon hack to get LLaMA to run on a MacBook is now the upstream reference for the GGUF format, a multi-backend tensor compute stack, and an OpenAI-compatible server.
Dates below are derived from git tags and commit timestamps in the ggml-org/llama.cpp history.
Eras
The original C/C++ port (Mar 2023)
The first commit, Initial release, lands on 2023-03-10 — only days after Meta's LLaMA weights leak. The repository's purpose at this stage is narrow: read a LLaMA checkpoint, run it on CPU with manual quantization, and prove that 4-bit weights give usable inference on a laptop. Most of ggml.c and llama.cpp are written by Georgi Gerganov in the first weeks. The code is plain C++ with hand-rolled tensor types, no GPU support, and a single executable.
The format wars: GGML → GGJT → GGUF (May–Aug 2023)
As more models join the project (Alpaca, Vicuna, Falcon, MPT, StarCoder), the ad-hoc model file format becomes a maintenance problem. A pair of format migrations follows: the original GGML binary becomes GGJT to add proper alignment and metadata, then GGUF lands in Aug 2023 as a single-file, self-describing, versioned container. GGUF is what survives — every modern checkpoint, every quantization, every multi-modal projector ships as .gguf. The reader/writer at ggml/src/gguf.cpp and the Python writer in gguf-py/ are the still-living artifacts.
Backend explosion (Mid-2023 – 2024)
The CPU-only era gives way to a multi-backend era as community contributors port GGML's compute graph to every accelerator they can find. In rough chronological order: Metal (Apple Silicon, May–Jun 2023), CUDA (NVIDIA, May 2023 onward, eventually growing into the largest backend by file count), OpenCL (deprecated and reborn for Qualcomm), CLBlast / CuBLAS (Jun 2023), HIP / ROCm (AMD), Vulkan (cross-vendor, 2024), SYCL / oneAPI (Intel), Hexagon / Snapdragon (Qualcomm), CANN (Huawei Ascend), MUSA (Moore Threads), WebGPU, OpenVINO, and zDNN (IBM Z). The runtime backend registry in ggml-backend-reg.cpp and the dynamic-loader plumbing in ggml-backend-dl.cpp are direct consequences of this growth.
The HTTP server arrives (mid-2023, ongoing)
The project ships an OpenAI-compatible server (tools/server) early on. Over time it grows a slot-based scheduler (server-queue.cpp, server-context.cpp), tools/function-calling, JSON-schema-constrained output, prompt caching, embeddings and reranking endpoints, and a built-in WebUI under tools/server/webui/. The server is one of the largest single tools in the tree and now drives most production usage.
Architectures, architectures, architectures (2023–present)
src/llama-model.cpp and src/models/ accumulate per-architecture builders for nearly every transformer family released in the last three years: LLaMA 1/2/3, Mistral, Mixtral and other MoEs (DBRX, Qwen-MoE, OLMoE, Snowflake-Arctic), Falcon, MPT, GPT-NeoX, Phi (and PhiMoE), Gemma 1/2/3, Qwen 1/2/3, Command-R, DeepSeek (incl. V2/V3), Mamba, RWKV-6/7, GLM, Yi, Granite, Bitnet, Hunyuan, LFM2, and many more. Each new family typically arrives in three pieces: a Python conversion path in convert_hf_to_gguf.py, an arch entry in src/llama-arch.cpp, and a graph builder in src/models/.
Multimodal goes mainstream (2024–2025)
The multimodal stack starts as the standalone llava tool and matures into the mtmd ("multi-modal data") subsystem under tools/mtmd/. It now supports vision encoders (CLIP-style, derived from LLaVA, MobileVLM, LLaVA-NeXT, Qwen2-VL, MiniCPM-V, Gemma-3, GLM-EDGE, Moondream, LFM2-VL, ...) and an audio path (Whisper-style encoders) used for speech-to-text and TTS. As of mid-2025 the HTTP server gained native multimodal support (PR #12898).
Quantization & quality (continuous)
Quantization is the project's defining feature. Over time the formats grow from naive 4-bit blocks to a zoo: Q2_K through Q8_K (k-quants, 2023), IQ-series (importance-matrix-aware 1.5–4-bit quants), MXFP4 (added 2025 with NVIDIA collaboration for gpt-oss), and per-tensor mixes selected by llama-quantize. The llama-imatrix tool produces per-tensor activation statistics that bias quantization toward important channels, and llama-perplexity is the standard quality yardstick.
Longest-standing features
| Component | First seen | Notes |
|---|---|---|
ggml.c core compute |
2023-03 | The bones of the project. Survived every backend migration. |
llama.cpp (now src/llama.cpp) |
2023-03 | The C API entry point. Has been refactored repeatedly — most logic now lives in llama-context, llama-model, etc. — but the public symbols in include/llama.h remain. |
Q4_0 quantization |
2023-03 | The original 4-bit block format. Still loadable, though k-quants and IQ-series are preferred. |
convert_legacy_llama.py |
2023-03 | Conversion script for the original LLaMA weights. Lives in examples/ as a historical artifact. |
Deprecated features
- GGML / GGJT model files — replaced by GGUF in Aug 2023. The
examples/deprecation-warning/binary exists solely to print a friendly message when someone runs an old executable name. mainandserverlegacy binary names — renamed tollama-cliandllama-server. The deprecation-warning binary points users at the new names; seetools/mtmd/deprecation-warning.cppandexamples/deprecation-warning/.llavastandalone tool — folded into themtmdstack undertools/mtmd/legacy-models/.finetune/train-text-from-scratch— removed; training now lives inexamples/training/on top ofggml-opt.- CLBlast / CuBLAS as separate backends — superseded by
GGML_BLAS(general BLAS) and integrated CUDA / Metal kernels.
Major rewrites
- Library split (2024) — the monolithic
llama.cppsource was split intollama-context,llama-model,llama-model-loader,llama-vocab,llama-graph,llama-kv-cache,llama-sampler,llama-grammar,llama-chat,llama-adapter,llama-batch,llama-mmap, etc. The public C API ininclude/llama.hwas preserved. - Sampler chain — the old
llama_sample_*free functions were replaced by a polymorphicllama_samplerchain (seesrc/llama-sampler.cpp) that composes top-k, top-p, temperature, mirostat, penalty, grammar, and infill samplers. - Backend registry / dynamic loading —
ggml_backend_reg.cppandggml-backend-dl.cppreplaced compile-time backend selection. A singlelibllamabuild can now pick its accelerator at runtime. - KV cache rewrite — the original linear KV buffer became a
llama-kv-cachewith sequence-aware slots, plusllama-kv-cache-iswafor sliding-window models andllama-memory-recurrentfor SSMs. - Server slot scheduler — what started as a single-request HTTP wrapper was rewritten into the slot-based scheduler now in
tools/server/server-context.cpp, supporting parallel decoding, prompt caching, and per-slot KV state.
Growth trajectory
| Signal | Then | Now |
|---|---|---|
| First commit | 2023-03-10 | — |
| Total commits on master | — | 8,991 |
| Unique contributors | 1 | 1,600+ |
| Backends | 1 (CPU) | 14+ (ggml/src/ggml-*) |
| Tools shipped | 1 (main) |
18+ under tools/ |
| Supported model families | LLaMA only | 60+ text + 12+ multimodal (see README.md) |
For day-to-day code patterns and review expectations, see How to contribute.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
By the numbers
Next
Fun facts