ggml-org/llama.cpp

Glossary

Project-specific vocabulary you will encounter while reading the code.

Term	Meaning
ggml	The in-tree tensor library under `ggml/`. Provides tensor types, memory allocators, computation graphs, and per-accelerator backends. Originated from the standalone `ggml-org/ggml` repo and is kept in sync via `scripts/sync-ggml*`.
GGUF	"GGML Universal Format". Single-file model container with metadata, vocab, and tensors. Successor to GGML/GGJT. Header in `ggml/src/gguf.cpp`; Python writer in `gguf-py/`.
libllama	The C library built from `src/llama*.cpp`. Public header is `include/llama.h`.
libggml	The C library built from `ggml/src/ggml*.c`. Public headers under `ggml/include/`.
GBNF	"GGML BNF". A BNF-like grammar dialect parsed by `src/llama-grammar.cpp` and used to constrain sampling. Sample grammars in `grammars/`.
llguidance	An optional Rust-based grammar engine (vendored under `vendor/`) integrated through `common/llguidance.cpp`. See `docs/llguidance.md`.
JSON Schema → grammar	Conversion path that turns a JSON Schema into GBNF, implemented in C++ at `common/json-schema-to-grammar.cpp` and in Python at `examples/json_schema_to_grammar.py`.
Jinja	The Jinja2-compatible template engine vendored under `common/jinja/` (the `google/minja` port). Powers `common/chat.cpp` for chat-template rendering.
chat template	Per-model prompt formatting. Either embedded in GGUF metadata (read by `src/llama-chat.cpp`) or supplied through `--chat-template` / `--chat-template-file`. Reference templates in `models/templates/`.
autoparser	Higher-level streaming parser for tool/function calls, built on top of the PEG parser. See `docs/autoparser.md` and `common/chat-auto-parser*.cpp`.
PEG parser	Parsing-Expression-Grammar engine in `common/peg-parser.cpp`. Used for tool calls and structured output. See `docs/development/parsing.md`.
KV cache	The key/value tensor cache for transformer attention. Implemented in `src/llama-kv-cache.cpp` (with `iswa` and `recurrent`/`hybrid` variants for sliding-window and SSM models).
slot	In `tools/server`, a per-request execution unit that owns its KV state and decodes independently. Defined in `tools/server/server-context.cpp`.
batch	A `llama_batch` struct (declared in `include/llama.h`, implemented in `src/llama-batch.cpp`) describing the tokens to decode in a single `llama_decode` call. Multiple sequences can be packed in one batch.
sequence (`seq_id`)	Logical conversation identifier inside a context. The KV cache tracks tokens per `seq_id`; the server uses one `seq_id` per slot.
logits	The per-token output scores returned by `llama_decode`. Consumed by samplers.
sampler / sampler chain	A pipeline of `llama_sampler` objects (top-k, top-p, temperature, mirostat, penalty, grammar, ...) implemented in `src/llama-sampler.cpp`.
imatrix	"Importance matrix". Per-tensor activation statistics produced by `llama-imatrix`, consumed by `llama-quantize` to bias quantization toward important channels. See imatrix tool.
mtmd	"Multi-modal" — the image+audio multimodal stack under `tools/mtmd/`. Includes the CLIP-based vision encoder (`clip.cpp`), audio encoder (`mtmd-audio.cpp`), and the `llama-mtmd-cli` binary.
mmproj	A "multimodal projector" GGUF file containing the vision/audio encoder weights. Loaded alongside a text model.
MoE	Mixture-of-Experts. Models like Mixtral, DBRX, Qwen-MoE that use sparse routing. Handled by per-architecture builders under `src/models/`.
SSM	State-Space Model. Mamba and RWKV are SSMs. Use the recurrent KV variant in `src/llama-memory-recurrent.cpp`.
iSWA	"Interleaved Sliding-Window Attention". Used by Gemma 2, Phi, and others; implemented in `src/llama-kv-cache-iswa.cpp`.
hparams	Hyperparameters describing a model architecture (layer count, head dim, rope settings, ...). Defined in `src/llama-hparams.h`, populated from GGUF metadata.
cparams	Per-context parameters (batch size, threads, defrag policy). Defined in `src/llama-cparams.h`.
arch	Model architecture. Enumerated in `src/llama-arch.h`/`.cpp` (LLM_ARCH_LLAMA, LLM_ARCH_GEMMA, LLM_ARCH_QWEN3, ...).
LoRA / control vector	Low-rank adapters and control vectors loaded via `src/llama-adapter.cpp` and applied on top of base weights at runtime.
speculative decoding	Drafting tokens with a smaller model and verifying with a larger one. Implemented in `common/speculative.cpp`; example tools in `examples/speculative*` and the `lookup`/`lookahead` examples.
fim	"Fill in the middle" code-completion mode used by `tools/completion` and the `llama.vim` / `llama.vscode` editor plugins.
ngram-cache / ngram-map	Lightweight n-gram drafter for speculative decoding. See `common/ngram-cache.cpp`, `common/ngram-map.cpp`.
RPC backend	A network backend that ships in `ggml/src/ggml-rpc/` and pairs with the `rpc-server` binary in `tools/rpc/` to offload tensors across machines.
`-hf` flag	Shortcut accepted by every CLI to download a model from HuggingFace (`<owner>/<repo>[:quant]`) into the local HuggingFace cache. Implemented in `common/download.cpp` and `common/hf-cache.cpp`.
`ggml-ci`	Internal label and self-hosted CI runners that exercise the long-form CI scripts under `ci/`. Triggered by maintainers on PRs.

For format-level details see Reference → Data models. For runtime configuration knobs see Reference → Configuration.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.