ggml-org/llama.cpp

Fun facts

A few bits of trivia surfaced while reading the repository.

The project's first commit was a complete LLM inference engine

The very first commit on master, 2023-03-10 by Georgi Gerganov, is titled simply Initial release and already contains a working LLaMA inference loop in C/C++ with 4-bit quantization. The project did not "grow into" being an LLM runtime — it shipped as one on day one.

The largest single file in the repository is a Python script

At 651 KB, convert_hf_to_gguf.py is the largest file by a wide margin. It is a multi-thousand-line switch over every supported HuggingFace model class, mapping each one's weight layout into the GGUF tensor naming convention used by src/llama-arch.cpp. The closest C++ rival is src/llama-model.cpp at 546 KB, which performs the equivalent job at load time.

`ggml.c` is hand-written portable C

ggml/src/ggml.c is ~248 KB of straight-line C with no template metaprogramming, no STL, and almost no headers beyond the standard library. The CONTRIBUTING guide explicitly tells contributors to "avoid fancy-looking modern STL constructs, use basic for loops, avoid templates, keep it simple" — and ggml.c is the canonical example.

Matrix multiplication is unconventional on purpose

Per CONTRIBUTING.md: C = ggml_mul_mat(ctx, A, B) actually computes $C^T = A B^T$. This convention saves a transpose in nearly every transformer kernel, so the API embraces it instead of fighting it. There is a literal hand-drawn diagram (media/matmul.png) in the repo to remind contributors.

llama.cpp does not accept AI-generated PRs

The project's AGENTS.md is one of the more pointed AI policies in open source: "This project does not accept pull requests that are fully or predominantly AI-generated." Maintainers explicitly close PRs whose authors cannot defend their own code in review. The wiki you are reading is a documentation artifact, not a contribution to the repo, so it falls outside that policy — but it is worth knowing if you plan to send code upstream.

The repo is its own deprecation-warning service

examples/deprecation-warning/ and tools/mtmd/deprecation-warning.cpp exist purely to print a friendly message when someone runs an old binary name (main, server, llava) that has since been renamed. They contain no inference logic.

The grammar engine has two implementations

llama.cpp ships two grammar engines that can constrain sampling: the in-tree GBNF parser (src/llama-grammar.cpp) and an optional Rust-based engine called llguidance (vendored under vendor/, integrated through common/llguidance.cpp). On top of that, JSON Schemas can be compiled to GBNF either in C++ (common/json-schema-to-grammar.cpp) or in Python (examples/json_schema_to_grammar.py). All four pipelines share the grammars/ sample-grammar directory.

The vendored Jinja engine is a single header

common/jinja/ contains a port of google/minja — a header-only Jinja2-compatible template engine — used to render chat templates pulled from GGUF metadata. Without it, the chat tool would have to ship a full Python interpreter or hard-code per-model formatting. With it, every binary tool gains arbitrary chat-template support for the cost of including one header.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.