ggml-org/llama.cpp

Model loader

Active contributors: Georgi Gerganov, Sigbjørn Skjæret

src/llama-model-loader.cpp reads a GGUF file (or a set of split GGUF files), validates it against the architecture registered in src/llama-arch.cpp, and produces a fully-populated llama_model plus llama_vocab. It is the bridge between the on-disk format and the in-memory model.

Purpose

Open and mmap (or read) one or more GGUF files.
Read GGUF metadata (architecture, hparams, vocab, chat template, ...) and validate it.
Resolve every tensor name in the architecture's expected manifest to a tensor in the file.
Allocate buffers on the correct backends and copy/quantize tensor data as needed.

Directory layout

src/
├── llama-model-loader.h       # public types: llama_model_loader, helpers for kv lookups
├── llama-model-loader.cpp     # actual loading logic, ~71 KB
├── llama-mmap.cpp / .h        # cross-platform mmap, prefetch, mlock
├── llama-arch.cpp / .h        # LLM_ARCH_* enum + per-arch tensor manifest
├── llama-hparams.cpp / .h     # struct llama_hparams (per-arch hyperparameters)
└── llama-model.cpp / .h       # llama_model itself

Key abstractions

Type	File	Role
`llama_model_loader`	`src/llama-model-loader.h`	Open file(s), expose `gguf_context`, manage tensor enumeration
`llama_model_kv_override`	`src/llama-model-loader.h`	Override a single KV pair on load (`--override-kv` flag)
`LLM_ARCH_*` enum	`src/llama-arch.h`	Architecture identifier (`LLM_ARCH_LLAMA`, `LLM_ARCH_GEMMA3`, ...)
`LLM_TENSOR_*` enum	`src/llama-arch.h`	Logical tensor name (`LLM_TENSOR_TOKEN_EMBD`, `LLM_TENSOR_OUTPUT_NORM`, ...)
`llm_arch_info` table	`src/llama-arch.cpp`	Per-arch mapping from `LLM_TENSOR_*` to GGUF tensor names
`llama_hparams`	`src/llama-hparams.h`	Layer count, head dim, RoPE settings, vocab size, ...
`llama_mmap`, `llama_mlock`	`src/llama-mmap.h`	RAII wrappers over `mmap`/`MapViewOfFile`, `mlock`/`VirtualLock`

How it works

sequenceDiagram
    participant App
    participant Loader as llama_model_loader
    participant GGUF as gguf_context (ggml/src/gguf.cpp)
    participant Arch as llm_arch_info
    participant Model as llama_model

    App->>Loader: load(path or splits, params)
    Loader->>GGUF: gguf_init_from_file(s)
    GGUF-->>Loader: kv pairs + tensor headers
    Loader->>Loader: read general.architecture
    Loader->>Arch: lookup LLM_ARCH_*
    Arch-->>Loader: expected tensor names + types
    Loader->>Loader: read llama_hparams from kv
    Loader->>Loader: build vocab via llama-vocab.cpp
    Loader->>Model: allocate llama_model with tensors
    Loader->>Model: copy/quantize each tensor into backend buffers
    Model-->>App: ready

GGUF reading itself lives in ggml/src/gguf.cpp. The loader is responsible for the higher-level "is this file consistent with the architecture I claim it is?" validation.

Splits

llama-model-loader.cpp natively understands split GGUFs (e.g. model-00001-of-00003.gguf). The split format and naming convention is shared with tools/gguf-split — see gguf-split tool.

KV overrides

Tools accept --override-kv key=type:value to patch GGUF metadata at load time. This is implemented as a list of llama_model_kv_override consulted before the loader reads each metadata key.

Integration points

Quantization. llama-quant.cpp reuses the loader to read a source model, then writes a quantized output via a sibling writer. See Quantization.
State save/load. src/llama-model-saver.cpp writes a model back out, used for adapter merging.
Adapters. src/llama-adapter.cpp uses the loader's GGUF helpers to read LoRA adapter files alongside the base model.
CLI. Tools usually call llama_model_load_from_file (or _from_splits) from common/common.cpp, after argument parsing in common/arg.cpp.

Entry points for modification

New architecture. Add an LLM_ARCH_* enum value in src/llama-arch.h, populate the llm_arch_info table in src/llama-arch.cpp with the expected tensor names, define the per-arch graph in src/models/<your-arch>.cpp, and add a Python conversion path in convert_hf_to_gguf.py. The full recipe is docs/development/HOWTO-add-model.md.
New metadata key. Add the constant to src/llama-arch.h (the LLM_KV_* enum) and a getter helper in llama-model-loader.
New tensor naming. Add the canonical name to LLM_TENSOR_* and the per-arch override to llm_arch_info.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.