ggml-org/llama.cpp
Data models
The file and wire formats llama.cpp uses.
GGUF
GGUF ("GGML Universal Format") is the canonical model file format. It is a single-file container with:
- Header — magic, version, tensor count, kv-pair count.
- Metadata KVs — typed key/value pairs (architecture, hparams, tokenizer, chat template, ...).
- Tensor headers — name, shape, type, offset.
- Aligned tensor data — contiguous bytes, typically mmap'd.
| Reader/writer | Language | File |
|---|---|---|
| In-tree C++ | C++ | ggml/include/gguf.h, ggml/src/gguf.cpp |
| In-tree Python | Python | gguf-py/gguf/gguf_reader.py, gguf-py/gguf/gguf_writer.py |
Common metadata keys
A non-exhaustive list (the full set lives in src/llama-arch.cpp and gguf-py/gguf/constants.py):
| Key | Type | Purpose |
|---|---|---|
general.architecture |
string | One of llama, qwen3, gemma3, ... |
general.name, general.author, general.license |
strings | Human-friendly metadata |
general.quantization_version |
uint | Quant scheme version |
general.file_type |
uint | LLAMA_FTYPE_* |
general.size_label |
string | "7B", "70B", ... |
<arch>.context_length, <arch>.embedding_length, <arch>.block_count, <arch>.attention.head_count, <arch>.attention.head_count_kv, <arch>.feed_forward_length, <arch>.rope.*, <arch>.expert_* |
various | Hparams |
tokenizer.ggml.model |
string | llama (SPM), gpt2 (BPE), bert (WPM), ... |
tokenizer.ggml.tokens, ..token_type, ..scores, ..merges |
arrays | Vocab data |
tokenizer.ggml.bos_token_id, ..eos_token_id, ..pad_token_id, ... |
uint | Special tokens |
tokenizer.ggml.pre |
string | Pre-tokenizer family identifier |
tokenizer.chat_template |
string | Jinja chat template |
Tensor naming
Tensor names are templated. Examples:
token_embd.weight
output_norm.weight
output.weight
blk.{i}.attn_norm.weight
blk.{i}.attn_q.weight
blk.{i}.attn_k.weight
blk.{i}.attn_v.weight
blk.{i}.attn_output.weight
blk.{i}.ffn_norm.weight
blk.{i}.ffn_gate.weight
blk.{i}.ffn_down.weight
blk.{i}.ffn_up.weightPer-architecture overrides are encoded in src/llama-arch.cpp::LLM_TENSOR_NAMES.
Splits
A split GGUF set is named <base>-NNNNN-of-MMMMM.gguf (e.g. model-00001-of-00003.gguf). Each split holds a contiguous slice of the tensor stream and an index header. tools/gguf-split produces and merges splits; llama_model_load_from_splits reads them in one go.
Importance matrix (.imatrix)
Produced by llama-imatrix. A GGUF file with one tensor per source matmul, holding per-channel L2 magnitudes. Consumed by llama-quantize --imatrix. See imatrix tool.
LoRA adapter (.gguf adapter)
A GGUF with general.type=adapter, adapter.type=lora, adapter.alpha=<scale>, and per-layer A/B factor pairs named to match the base model's tensor manifest. Loaded by llama_adapter_lora_init. Produced by convert_lora_to_gguf.py. See Adapters.
Control vector (.gguf)
Generated by tools/cvector-generator/. Per-layer additive bias vectors. Loaded with --control-vector.
OpenAI-compatible JSON
tools/server accepts and emits OpenAI-shaped JSON for:
/v1/chat/completions—{ messages, tools, tool_choice, response_format, ... }→{ choices: [{ message: { role, content, tool_calls } }] }(streaming via SSE)./v1/completions—{ prompt }→{ choices: [{ text }] }./v1/embeddings—{ input }→{ data: [{ embedding }] }./v1/rerank—{ query, documents }→{ results: [{ index, relevance_score }] }.
Plus llama.cpp-specific extension fields (grammar, json_schema, mirostat, dry_*, xtc_*, cache_prompt, slot_id, ...) that tools/server/server-task.cpp parses out. See tools/server/README.md.
Slot-save format
POST /slots/{id}?action=save&filename=... writes a binary file containing the serialized KV state for one slot, produced by llama_state_seq_get_data. It is opaque — the receiving server reloads it via llama_state_seq_set_data.
Native completion / FIM JSON
POST /completion and POST /infill accept extension fields beyond OpenAI's (token-level prompt, return-tokens-with-probs, etc.). The full surface is documented in tools/server/README.md.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Configuration
Next
Dependencies