ggml-org/llama.cpp

Architecture switch

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

llama.cpp supports a long list of model families. Each one differs in attention layout, position encoding, normalization, MoE routing, vocabulary, and many other details. The "architecture switch" is the project's way of keeping that complexity manageable: a single enum identifies each architecture, and a per-architecture file under src/models/ builds its computation graph.

Purpose

Map the general.architecture GGUF metadata key to a fully-formed ggml_cgraph for any supported model family.

How it works

graph TD
    GGUFKey["GGUF: general.architecture = 'llama'"]
    Enum[LLM_ARCH_LLAMA in src/llama-arch.h]
    Loader[llama-model-loader reads tensors named per arch]
    BuildGraph[llama-graph dispatches by arch]
    Builder[src/models/llama.cpp llm_build_llama]
    Compute[ggml_cgraph executed by ggml_backend_sched]

    GGUFKey --> Enum
    Enum --> Loader
    Enum --> BuildGraph
    BuildGraph --> Builder
    Builder --> Compute

The relevant pieces:

Enum. enum llm_arch in src/llama-arch.h (LLM_ARCH_LLAMA, LLM_ARCH_FALCON, LLM_ARCH_GEMMA3, LLM_ARCH_QWEN3, LLM_ARCH_MAMBA, LLM_ARCH_RWKV6, ...). String mapping LLM_ARCH_NAMES in src/llama-arch.cpp.
Tensor manifest. Each enum value has a row in the LLM_TENSOR_NAMES table inside src/llama-arch.cpp describing the canonical GGUF tensor names (token_embd.weight, blk.{N}.attn_norm.weight, ...). The loader uses this to find the right tensors.
Per-architecture graph builder. Files in src/models/ such as llama.cpp, gemma3.cpp, qwen3.cpp, phi3.cpp, mamba.cpp, rwkv6.cpp, bitnet.cpp, and so on. Each defines a llm_build_* class derived from helpers in src/llama-graph.h.
Dispatcher. llama-graph.cpp (and llama-context.cpp for the choice of memory backend) reads model->arch and constructs the matching builder.

Directory layout

src/
├── llama-arch.h           # llm_arch enum, LLM_TENSOR_* enum, LLM_KV_* enum
├── llama-arch.cpp         # arch name strings, LLM_TENSOR_NAMES table, KV name strings
├── llama-graph.h          # graph-building helpers shared by all architectures
├── llama-graph.cpp        # entry point that dispatches on arch
├── llama-model.cpp        # tensor allocation per arch
└── models/
    ├── llama.cpp          # llm_build_llama
    ├── llama4.cpp         # llm_build_llama4
    ├── gemma.cpp / gemma2.cpp / gemma3.cpp
    ├── qwen.cpp / qwen2.cpp / qwen3.cpp / qwen3moe.cpp
    ├── phi2.cpp / phi3.cpp / phimoe.cpp
    ├── mistral.cpp / mixtral.cpp
    ├── falcon.cpp / mpt.cpp / refact.cpp / starcoder.cpp / bloom.cpp / bert.cpp
    ├── deepseek.cpp / deepseek2.cpp
    ├── command-r.cpp / command-r7b.cpp
    ├── granite.cpp / granitemoe.cpp / granitehybrid.cpp
    ├── mamba.cpp / mamba2.cpp / rwkv6.cpp / rwkv7.cpp
    ├── bitnet.cpp / bitnet-b1-58.cpp
    ├── glm4.cpp / glmedge.cpp / chatglm.cpp
    ├── internlm2.cpp / orion.cpp / yi.cpp / xverse.cpp
    ├── jamba.cpp / dbrx.cpp / openelm.cpp / olmo.cpp / olmo2.cpp / olmoe.cpp
    ├── snowflake-arctic.cpp / hunyuan.cpp / lfm2.cpp / lfm2vl.cpp
    └── ... (~70 files total)

Adding a new architecture

The canonical recipe lives in docs/development/HOWTO-add-model.md. The short version:

Pick a name. Add it to the llm_arch enum in src/llama-arch.h and to LLM_ARCH_NAMES in src/llama-arch.cpp.
Define the tensor manifest. Add a row to LLM_TENSOR_NAMES in src/llama-arch.cpp listing every tensor name your model uses.
Add hparams fields if needed. Edit src/llama-hparams.h for any new hyperparameter.
Implement the graph. Create src/models/<your-arch>.cpp defining llm_build_<arch>. Use the helpers in src/llama-graph.h to keep it consistent with neighboring architectures.
Allocate tensors. Add a branch in src/llama-model.cpp::load_tensors that allocates per-layer tensors with the right shapes and types.
Wire the dispatcher. src/llama-graph.cpp and src/llama-context.cpp need new branches selecting your builder and the right memory backend (KV-cache vs recurrent vs hybrid).
Conversion. Add a Model subclass in convert_hf_to_gguf.py that emits the right tensor names and metadata.
Test. Convert a small checkpoint, run llama-cli with it, run llama-perplexity, post numbers in the PR.

CONTRIBUTING.md notes that new model PRs should focus on CPU-only support; backend-specific kernels can follow.

Why this layout works

The architecture switch keeps llama_model and llama_context agnostic of any single architecture. Adding a new family generally means adding files, not editing existing ones — except for the dispatch sites, which are intentionally short. Reviewers can read a single src/models/<arch>.cpp and have the entire graph for that family in one place.

Integration points

Loader. Reads general.architecture, looks up llm_arch, asks llama_model to allocate using the matching tensor manifest. See Model loader.
Graph builder. llama-graph.cpp calls into the per-arch builder during llama_decode.
Memory backend. Standard models use llama-kv-cache. Sliding-window models (Gemma 2/3, some Qwen variants, Phi-3) use llama-kv-cache-iswa. SSMs (Mamba, RWKV) use llama-memory-recurrent. Hybrids (Jamba, Granite Hybrid) use llama-memory-hybrid / llama-memory-hybrid-iswa. See KV cache and memory.
Conversion. convert_hf_to_gguf.py is the inverse of this whole subsystem on the Python side.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.