Factory.ai

Open-Source Wikis

/

llama.cpp

/

llama.cpp

/

Glossary

ggml-org/llama.cpp

Glossary

Project-specific vocabulary you will encounter while reading the code.

Term Meaning
ggml The in-tree tensor library under ggml/. Provides tensor types, memory allocators, computation graphs, and per-accelerator backends. Originated from the standalone ggml-org/ggml repo and is kept in sync via scripts/sync-ggml*.
GGUF "GGML Universal Format". Single-file model container with metadata, vocab, and tensors. Successor to GGML/GGJT. Header in ggml/src/gguf.cpp; Python writer in gguf-py/.
libllama The C library built from src/llama*.cpp. Public header is include/llama.h.
libggml The C library built from ggml/src/ggml*.c. Public headers under ggml/include/.
GBNF "GGML BNF". A BNF-like grammar dialect parsed by src/llama-grammar.cpp and used to constrain sampling. Sample grammars in grammars/.
llguidance An optional Rust-based grammar engine (vendored under vendor/) integrated through common/llguidance.cpp. See docs/llguidance.md.
JSON Schema → grammar Conversion path that turns a JSON Schema into GBNF, implemented in C++ at common/json-schema-to-grammar.cpp and in Python at examples/json_schema_to_grammar.py.
Jinja The Jinja2-compatible template engine vendored under common/jinja/ (the google/minja port). Powers common/chat.cpp for chat-template rendering.
chat template Per-model prompt formatting. Either embedded in GGUF metadata (read by src/llama-chat.cpp) or supplied through --chat-template / --chat-template-file. Reference templates in models/templates/.
autoparser Higher-level streaming parser for tool/function calls, built on top of the PEG parser. See docs/autoparser.md and common/chat-auto-parser*.cpp.
PEG parser Parsing-Expression-Grammar engine in common/peg-parser.cpp. Used for tool calls and structured output. See docs/development/parsing.md.
KV cache The key/value tensor cache for transformer attention. Implemented in src/llama-kv-cache.cpp (with iswa and recurrent/hybrid variants for sliding-window and SSM models).
slot In tools/server, a per-request execution unit that owns its KV state and decodes independently. Defined in tools/server/server-context.cpp.
batch A llama_batch struct (declared in include/llama.h, implemented in src/llama-batch.cpp) describing the tokens to decode in a single llama_decode call. Multiple sequences can be packed in one batch.
sequence (seq_id) Logical conversation identifier inside a context. The KV cache tracks tokens per seq_id; the server uses one seq_id per slot.
logits The per-token output scores returned by llama_decode. Consumed by samplers.
sampler / sampler chain A pipeline of llama_sampler objects (top-k, top-p, temperature, mirostat, penalty, grammar, ...) implemented in src/llama-sampler.cpp.
imatrix "Importance matrix". Per-tensor activation statistics produced by llama-imatrix, consumed by llama-quantize to bias quantization toward important channels. See imatrix tool.
mtmd "Multi-modal" — the image+audio multimodal stack under tools/mtmd/. Includes the CLIP-based vision encoder (clip.cpp), audio encoder (mtmd-audio.cpp), and the llama-mtmd-cli binary.
mmproj A "multimodal projector" GGUF file containing the vision/audio encoder weights. Loaded alongside a text model.
MoE Mixture-of-Experts. Models like Mixtral, DBRX, Qwen-MoE that use sparse routing. Handled by per-architecture builders under src/models/.
SSM State-Space Model. Mamba and RWKV are SSMs. Use the recurrent KV variant in src/llama-memory-recurrent.cpp.
iSWA "Interleaved Sliding-Window Attention". Used by Gemma 2, Phi, and others; implemented in src/llama-kv-cache-iswa.cpp.
hparams Hyperparameters describing a model architecture (layer count, head dim, rope settings, ...). Defined in src/llama-hparams.h, populated from GGUF metadata.
cparams Per-context parameters (batch size, threads, defrag policy). Defined in src/llama-cparams.h.
arch Model architecture. Enumerated in src/llama-arch.h/.cpp (LLM_ARCH_LLAMA, LLM_ARCH_GEMMA, LLM_ARCH_QWEN3, ...).
LoRA / control vector Low-rank adapters and control vectors loaded via src/llama-adapter.cpp and applied on top of base weights at runtime.
speculative decoding Drafting tokens with a smaller model and verifying with a larger one. Implemented in common/speculative.cpp; example tools in examples/speculative* and the lookup/lookahead examples.
fim "Fill in the middle" code-completion mode used by tools/completion and the llama.vim / llama.vscode editor plugins.
ngram-cache / ngram-map Lightweight n-gram drafter for speculative decoding. See common/ngram-cache.cpp, common/ngram-map.cpp.
RPC backend A network backend that ships in ggml/src/ggml-rpc/ and pairs with the rpc-server binary in tools/rpc/ to offload tensors across machines.
-hf flag Shortcut accepted by every CLI to download a model from HuggingFace (<owner>/<repo>[:quant]) into the local HuggingFace cache. Implemented in common/download.cpp and common/hf-cache.cpp.
ggml-ci Internal label and self-hosted CI runners that exercise the long-form CI scripts under ci/. Triggered by maintainers on PRs.

For format-level details see Reference → Data models. For runtime configuration knobs see Reference → Configuration.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Glossary – llama.cpp wiki | Factory