Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Model loader

ggml-org/llama.cpp

Model loader

Active contributors: Georgi Gerganov, Sigbjørn Skjæret

src/llama-model-loader.cpp reads a GGUF file (or a set of split GGUF files), validates it against the architecture registered in src/llama-arch.cpp, and produces a fully-populated llama_model plus llama_vocab. It is the bridge between the on-disk format and the in-memory model.

Purpose

  • Open and mmap (or read) one or more GGUF files.
  • Read GGUF metadata (architecture, hparams, vocab, chat template, ...) and validate it.
  • Resolve every tensor name in the architecture's expected manifest to a tensor in the file.
  • Allocate buffers on the correct backends and copy/quantize tensor data as needed.

Directory layout

src/
├── llama-model-loader.h       # public types: llama_model_loader, helpers for kv lookups
├── llama-model-loader.cpp     # actual loading logic, ~71 KB
├── llama-mmap.cpp / .h        # cross-platform mmap, prefetch, mlock
├── llama-arch.cpp / .h        # LLM_ARCH_* enum + per-arch tensor manifest
├── llama-hparams.cpp / .h     # struct llama_hparams (per-arch hyperparameters)
└── llama-model.cpp / .h       # llama_model itself

Key abstractions

Type File Role
llama_model_loader src/llama-model-loader.h Open file(s), expose gguf_context, manage tensor enumeration
llama_model_kv_override src/llama-model-loader.h Override a single KV pair on load (--override-kv flag)
LLM_ARCH_* enum src/llama-arch.h Architecture identifier (LLM_ARCH_LLAMA, LLM_ARCH_GEMMA3, ...)
LLM_TENSOR_* enum src/llama-arch.h Logical tensor name (LLM_TENSOR_TOKEN_EMBD, LLM_TENSOR_OUTPUT_NORM, ...)
llm_arch_info table src/llama-arch.cpp Per-arch mapping from LLM_TENSOR_* to GGUF tensor names
llama_hparams src/llama-hparams.h Layer count, head dim, RoPE settings, vocab size, ...
llama_mmap, llama_mlock src/llama-mmap.h RAII wrappers over mmap/MapViewOfFile, mlock/VirtualLock

How it works

sequenceDiagram
    participant App
    participant Loader as llama_model_loader
    participant GGUF as gguf_context (ggml/src/gguf.cpp)
    participant Arch as llm_arch_info
    participant Model as llama_model

    App->>Loader: load(path or splits, params)
    Loader->>GGUF: gguf_init_from_file(s)
    GGUF-->>Loader: kv pairs + tensor headers
    Loader->>Loader: read general.architecture
    Loader->>Arch: lookup LLM_ARCH_*
    Arch-->>Loader: expected tensor names + types
    Loader->>Loader: read llama_hparams from kv
    Loader->>Loader: build vocab via llama-vocab.cpp
    Loader->>Model: allocate llama_model with tensors
    Loader->>Model: copy/quantize each tensor into backend buffers
    Model-->>App: ready

GGUF reading itself lives in ggml/src/gguf.cpp. The loader is responsible for the higher-level "is this file consistent with the architecture I claim it is?" validation.

Splits

llama-model-loader.cpp natively understands split GGUFs (e.g. model-00001-of-00003.gguf). The split format and naming convention is shared with tools/gguf-split — see gguf-split tool.

KV overrides

Tools accept --override-kv key=type:value to patch GGUF metadata at load time. This is implemented as a list of llama_model_kv_override consulted before the loader reads each metadata key.

Integration points

  • Quantization. llama-quant.cpp reuses the loader to read a source model, then writes a quantized output via a sibling writer. See Quantization.
  • State save/load. src/llama-model-saver.cpp writes a model back out, used for adapter merging.
  • Adapters. src/llama-adapter.cpp uses the loader's GGUF helpers to read LoRA adapter files alongside the base model.
  • CLI. Tools usually call llama_model_load_from_file (or _from_splits) from common/common.cpp, after argument parsing in common/arg.cpp.

Entry points for modification

  • New architecture. Add an LLM_ARCH_* enum value in src/llama-arch.h, populate the llm_arch_info table in src/llama-arch.cpp with the expected tensor names, define the per-arch graph in src/models/<your-arch>.cpp, and add a Python conversion path in convert_hf_to_gguf.py. The full recipe is docs/development/HOWTO-add-model.md.
  • New metadata key. Add the constant to src/llama-arch.h (the LLM_KV_* enum) and a getter helper in llama-model-loader.
  • New tensor naming. Add the canonical name to LLM_TENSOR_* and the per-arch override to llm_arch_info.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Model loader – llama.cpp wiki | Factory