Factory.ai

Open-Source Wikis

/

llama.cpp

/

Reference

/

Data models

ggml-org/llama.cpp

Data models

The file and wire formats llama.cpp uses.

GGUF

GGUF ("GGML Universal Format") is the canonical model file format. It is a single-file container with:

  • Header — magic, version, tensor count, kv-pair count.
  • Metadata KVs — typed key/value pairs (architecture, hparams, tokenizer, chat template, ...).
  • Tensor headers — name, shape, type, offset.
  • Aligned tensor data — contiguous bytes, typically mmap'd.
Reader/writer Language File
In-tree C++ C++ ggml/include/gguf.h, ggml/src/gguf.cpp
In-tree Python Python gguf-py/gguf/gguf_reader.py, gguf-py/gguf/gguf_writer.py

Common metadata keys

A non-exhaustive list (the full set lives in src/llama-arch.cpp and gguf-py/gguf/constants.py):

Key Type Purpose
general.architecture string One of llama, qwen3, gemma3, ...
general.name, general.author, general.license strings Human-friendly metadata
general.quantization_version uint Quant scheme version
general.file_type uint LLAMA_FTYPE_*
general.size_label string "7B", "70B", ...
<arch>.context_length, <arch>.embedding_length, <arch>.block_count, <arch>.attention.head_count, <arch>.attention.head_count_kv, <arch>.feed_forward_length, <arch>.rope.*, <arch>.expert_* various Hparams
tokenizer.ggml.model string llama (SPM), gpt2 (BPE), bert (WPM), ...
tokenizer.ggml.tokens, ..token_type, ..scores, ..merges arrays Vocab data
tokenizer.ggml.bos_token_id, ..eos_token_id, ..pad_token_id, ... uint Special tokens
tokenizer.ggml.pre string Pre-tokenizer family identifier
tokenizer.chat_template string Jinja chat template

Tensor naming

Tensor names are templated. Examples:

token_embd.weight
output_norm.weight
output.weight
blk.{i}.attn_norm.weight
blk.{i}.attn_q.weight
blk.{i}.attn_k.weight
blk.{i}.attn_v.weight
blk.{i}.attn_output.weight
blk.{i}.ffn_norm.weight
blk.{i}.ffn_gate.weight
blk.{i}.ffn_down.weight
blk.{i}.ffn_up.weight

Per-architecture overrides are encoded in src/llama-arch.cpp::LLM_TENSOR_NAMES.

Splits

A split GGUF set is named <base>-NNNNN-of-MMMMM.gguf (e.g. model-00001-of-00003.gguf). Each split holds a contiguous slice of the tensor stream and an index header. tools/gguf-split produces and merges splits; llama_model_load_from_splits reads them in one go.

Importance matrix (.imatrix)

Produced by llama-imatrix. A GGUF file with one tensor per source matmul, holding per-channel L2 magnitudes. Consumed by llama-quantize --imatrix. See imatrix tool.

LoRA adapter (.gguf adapter)

A GGUF with general.type=adapter, adapter.type=lora, adapter.alpha=<scale>, and per-layer A/B factor pairs named to match the base model's tensor manifest. Loaded by llama_adapter_lora_init. Produced by convert_lora_to_gguf.py. See Adapters.

Control vector (.gguf)

Generated by tools/cvector-generator/. Per-layer additive bias vectors. Loaded with --control-vector.

OpenAI-compatible JSON

tools/server accepts and emits OpenAI-shaped JSON for:

  • /v1/chat/completions{ messages, tools, tool_choice, response_format, ... }{ choices: [{ message: { role, content, tool_calls } }] } (streaming via SSE).
  • /v1/completions{ prompt }{ choices: [{ text }] }.
  • /v1/embeddings{ input }{ data: [{ embedding }] }.
  • /v1/rerank{ query, documents }{ results: [{ index, relevance_score }] }.

Plus llama.cpp-specific extension fields (grammar, json_schema, mirostat, dry_*, xtc_*, cache_prompt, slot_id, ...) that tools/server/server-task.cpp parses out. See tools/server/README.md.

Slot-save format

POST /slots/{id}?action=save&filename=... writes a binary file containing the serialized KV state for one slot, produced by llama_state_seq_get_data. It is opaque — the receiving server reloads it via llama_state_seq_set_data.

Native completion / FIM JSON

POST /completion and POST /infill accept extension fields beyond OpenAI's (token-level prompt, return-tokens-with-probs, etc.). The full surface is documented in tools/server/README.md.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Data models – llama.cpp wiki | Factory