Factory.ai

Open-Source Wikis

/

llama.cpp

/

Lore

ggml-org/llama.cpp

Lore

The story of llama.cpp is the story of one repository that quickly became the standard way to run large language models on consumer hardware. What started as a Sunday-afternoon hack to get LLaMA to run on a MacBook is now the upstream reference for the GGUF format, a multi-backend tensor compute stack, and an OpenAI-compatible server.

Dates below are derived from git tags and commit timestamps in the ggml-org/llama.cpp history.

Eras

The original C/C++ port (Mar 2023)

The first commit, Initial release, lands on 2023-03-10 — only days after Meta's LLaMA weights leak. The repository's purpose at this stage is narrow: read a LLaMA checkpoint, run it on CPU with manual quantization, and prove that 4-bit weights give usable inference on a laptop. Most of ggml.c and llama.cpp are written by Georgi Gerganov in the first weeks. The code is plain C++ with hand-rolled tensor types, no GPU support, and a single executable.

The format wars: GGML → GGJT → GGUF (May–Aug 2023)

As more models join the project (Alpaca, Vicuna, Falcon, MPT, StarCoder), the ad-hoc model file format becomes a maintenance problem. A pair of format migrations follows: the original GGML binary becomes GGJT to add proper alignment and metadata, then GGUF lands in Aug 2023 as a single-file, self-describing, versioned container. GGUF is what survives — every modern checkpoint, every quantization, every multi-modal projector ships as .gguf. The reader/writer at ggml/src/gguf.cpp and the Python writer in gguf-py/ are the still-living artifacts.

Backend explosion (Mid-2023 – 2024)

The CPU-only era gives way to a multi-backend era as community contributors port GGML's compute graph to every accelerator they can find. In rough chronological order: Metal (Apple Silicon, May–Jun 2023), CUDA (NVIDIA, May 2023 onward, eventually growing into the largest backend by file count), OpenCL (deprecated and reborn for Qualcomm), CLBlast / CuBLAS (Jun 2023), HIP / ROCm (AMD), Vulkan (cross-vendor, 2024), SYCL / oneAPI (Intel), Hexagon / Snapdragon (Qualcomm), CANN (Huawei Ascend), MUSA (Moore Threads), WebGPU, OpenVINO, and zDNN (IBM Z). The runtime backend registry in ggml-backend-reg.cpp and the dynamic-loader plumbing in ggml-backend-dl.cpp are direct consequences of this growth.

The HTTP server arrives (mid-2023, ongoing)

The project ships an OpenAI-compatible server (tools/server) early on. Over time it grows a slot-based scheduler (server-queue.cpp, server-context.cpp), tools/function-calling, JSON-schema-constrained output, prompt caching, embeddings and reranking endpoints, and a built-in WebUI under tools/server/webui/. The server is one of the largest single tools in the tree and now drives most production usage.

Architectures, architectures, architectures (2023–present)

src/llama-model.cpp and src/models/ accumulate per-architecture builders for nearly every transformer family released in the last three years: LLaMA 1/2/3, Mistral, Mixtral and other MoEs (DBRX, Qwen-MoE, OLMoE, Snowflake-Arctic), Falcon, MPT, GPT-NeoX, Phi (and PhiMoE), Gemma 1/2/3, Qwen 1/2/3, Command-R, DeepSeek (incl. V2/V3), Mamba, RWKV-6/7, GLM, Yi, Granite, Bitnet, Hunyuan, LFM2, and many more. Each new family typically arrives in three pieces: a Python conversion path in convert_hf_to_gguf.py, an arch entry in src/llama-arch.cpp, and a graph builder in src/models/.

Multimodal goes mainstream (2024–2025)

The multimodal stack starts as the standalone llava tool and matures into the mtmd ("multi-modal data") subsystem under tools/mtmd/. It now supports vision encoders (CLIP-style, derived from LLaVA, MobileVLM, LLaVA-NeXT, Qwen2-VL, MiniCPM-V, Gemma-3, GLM-EDGE, Moondream, LFM2-VL, ...) and an audio path (Whisper-style encoders) used for speech-to-text and TTS. As of mid-2025 the HTTP server gained native multimodal support (PR #12898).

Quantization & quality (continuous)

Quantization is the project's defining feature. Over time the formats grow from naive 4-bit blocks to a zoo: Q2_K through Q8_K (k-quants, 2023), IQ-series (importance-matrix-aware 1.5–4-bit quants), MXFP4 (added 2025 with NVIDIA collaboration for gpt-oss), and per-tensor mixes selected by llama-quantize. The llama-imatrix tool produces per-tensor activation statistics that bias quantization toward important channels, and llama-perplexity is the standard quality yardstick.

Longest-standing features

Component First seen Notes
ggml.c core compute 2023-03 The bones of the project. Survived every backend migration.
llama.cpp (now src/llama.cpp) 2023-03 The C API entry point. Has been refactored repeatedly — most logic now lives in llama-context, llama-model, etc. — but the public symbols in include/llama.h remain.
Q4_0 quantization 2023-03 The original 4-bit block format. Still loadable, though k-quants and IQ-series are preferred.
convert_legacy_llama.py 2023-03 Conversion script for the original LLaMA weights. Lives in examples/ as a historical artifact.

Deprecated features

  • GGML / GGJT model files — replaced by GGUF in Aug 2023. The examples/deprecation-warning/ binary exists solely to print a friendly message when someone runs an old executable name.
  • main and server legacy binary names — renamed to llama-cli and llama-server. The deprecation-warning binary points users at the new names; see tools/mtmd/deprecation-warning.cpp and examples/deprecation-warning/.
  • llava standalone tool — folded into the mtmd stack under tools/mtmd/legacy-models/.
  • finetune / train-text-from-scratch — removed; training now lives in examples/training/ on top of ggml-opt.
  • CLBlast / CuBLAS as separate backends — superseded by GGML_BLAS (general BLAS) and integrated CUDA / Metal kernels.

Major rewrites

  • Library split (2024) — the monolithic llama.cpp source was split into llama-context, llama-model, llama-model-loader, llama-vocab, llama-graph, llama-kv-cache, llama-sampler, llama-grammar, llama-chat, llama-adapter, llama-batch, llama-mmap, etc. The public C API in include/llama.h was preserved.
  • Sampler chain — the old llama_sample_* free functions were replaced by a polymorphic llama_sampler chain (see src/llama-sampler.cpp) that composes top-k, top-p, temperature, mirostat, penalty, grammar, and infill samplers.
  • Backend registry / dynamic loadingggml_backend_reg.cpp and ggml-backend-dl.cpp replaced compile-time backend selection. A single libllama build can now pick its accelerator at runtime.
  • KV cache rewrite — the original linear KV buffer became a llama-kv-cache with sequence-aware slots, plus llama-kv-cache-iswa for sliding-window models and llama-memory-recurrent for SSMs.
  • Server slot scheduler — what started as a single-request HTTP wrapper was rewritten into the slot-based scheduler now in tools/server/server-context.cpp, supporting parallel decoding, prompt caching, and per-slot KV state.

Growth trajectory

Signal Then Now
First commit 2023-03-10
Total commits on master 8,991
Unique contributors 1 1,600+
Backends 1 (CPU) 14+ (ggml/src/ggml-*)
Tools shipped 1 (main) 18+ under tools/
Supported model families LLaMA only 60+ text + 12+ multimodal (see README.md)

For day-to-day code patterns and review expectations, see How to contribute.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Lore – llama.cpp wiki | Factory