ggml-org/llama.cpp

Data models

The file and wire formats llama.cpp uses.

GGUF

GGUF ("GGML Universal Format") is the canonical model file format. It is a single-file container with:

Header — magic, version, tensor count, kv-pair count.
Metadata KVs — typed key/value pairs (architecture, hparams, tokenizer, chat template, ...).
Tensor headers — name, shape, type, offset.
Aligned tensor data — contiguous bytes, typically mmap'd.

Reader/writer	Language	File
In-tree C++	C++	`ggml/include/gguf.h`, `ggml/src/gguf.cpp`
In-tree Python	Python	`gguf-py/gguf/gguf_reader.py`, `gguf-py/gguf/gguf_writer.py`

Common metadata keys

A non-exhaustive list (the full set lives in src/llama-arch.cpp and gguf-py/gguf/constants.py):

Key	Type	Purpose
`general.architecture`	string	One of `llama`, `qwen3`, `gemma3`, ...
`general.name`, `general.author`, `general.license`	strings	Human-friendly metadata
`general.quantization_version`	uint	Quant scheme version
`general.file_type`	uint	`LLAMA_FTYPE_*`
`general.size_label`	string	"7B", "70B", ...
`<arch>.context_length`, `<arch>.embedding_length`, `<arch>.block_count`, `<arch>.attention.head_count`, `<arch>.attention.head_count_kv`, `<arch>.feed_forward_length`, `<arch>.rope.`, `<arch>.expert_`	various	Hparams
`tokenizer.ggml.model`	string	`llama` (SPM), `gpt2` (BPE), `bert` (WPM), ...
`tokenizer.ggml.tokens`, `..token_type`, `..scores`, `..merges`	arrays	Vocab data
`tokenizer.ggml.bos_token_id`, `..eos_token_id`, `..pad_token_id`, ...	uint	Special tokens
`tokenizer.ggml.pre`	string	Pre-tokenizer family identifier
`tokenizer.chat_template`	string	Jinja chat template

Tensor naming

Tensor names are templated. Examples:

token_embd.weight
output_norm.weight
output.weight
blk.{i}.attn_norm.weight
blk.{i}.attn_q.weight
blk.{i}.attn_k.weight
blk.{i}.attn_v.weight
blk.{i}.attn_output.weight
blk.{i}.ffn_norm.weight
blk.{i}.ffn_gate.weight
blk.{i}.ffn_down.weight
blk.{i}.ffn_up.weight

Per-architecture overrides are encoded in src/llama-arch.cpp::LLM_TENSOR_NAMES.

A split GGUF set is named <base>-NNNNN-of-MMMMM.gguf (e.g. model-00001-of-00003.gguf). Each split holds a contiguous slice of the tensor stream and an index header. tools/gguf-split produces and merges splits; llama_model_load_from_splits reads them in one go.

Importance matrix (`.imatrix`)

Produced by llama-imatrix. A GGUF file with one tensor per source matmul, holding per-channel L2 magnitudes. Consumed by llama-quantize --imatrix. See imatrix tool.

LoRA adapter (`.gguf` adapter)

A GGUF with general.type=adapter, adapter.type=lora, adapter.alpha=<scale>, and per-layer A/B factor pairs named to match the base model's tensor manifest. Loaded by llama_adapter_lora_init. Produced by convert_lora_to_gguf.py. See Adapters.

Control vector (`.gguf`)

Generated by tools/cvector-generator/. Per-layer additive bias vectors. Loaded with --control-vector.

OpenAI-compatible JSON

tools/server accepts and emits OpenAI-shaped JSON for:

/v1/chat/completions — { messages, tools, tool_choice, response_format, ... } → { choices: [{ message: { role, content, tool_calls } }] } (streaming via SSE).
/v1/completions — { prompt } → { choices: [{ text }] }.
/v1/embeddings — { input } → { data: [{ embedding }] }.
/v1/rerank — { query, documents } → { results: [{ index, relevance_score }] }.

Plus llama.cpp-specific extension fields (grammar, json_schema, mirostat, dry_*, xtc_*, cache_prompt, slot_id, ...) that tools/server/server-task.cpp parses out. See tools/server/README.md.

Slot-save format

POST /slots/{id}?action=save&filename=... writes a binary file containing the serialized KV state for one slot, produced by llama_state_seq_get_data. It is opaque — the receiving server reloads it via llama_state_seq_set_data.

Native completion / FIM JSON

POST /completion and POST /infill accept extension fields beyond OpenAI's (token-level prompt, return-tokens-with-probs, etc.). The full surface is documented in tools/server/README.md.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.