ggml-org/llama.cpp
Architecture switch
Active contributors: Sigbjørn Skjæret, Georgi Gerganov
llama.cpp supports a long list of model families. Each one differs in attention layout, position encoding, normalization, MoE routing, vocabulary, and many other details. The "architecture switch" is the project's way of keeping that complexity manageable: a single enum identifies each architecture, and a per-architecture file under src/models/ builds its computation graph.
Purpose
Map the general.architecture GGUF metadata key to a fully-formed ggml_cgraph for any supported model family.
How it works
graph TD
GGUFKey["GGUF: general.architecture = 'llama'"]
Enum[LLM_ARCH_LLAMA in src/llama-arch.h]
Loader[llama-model-loader reads tensors named per arch]
BuildGraph[llama-graph dispatches by arch]
Builder[src/models/llama.cpp llm_build_llama]
Compute[ggml_cgraph executed by ggml_backend_sched]
GGUFKey --> Enum
Enum --> Loader
Enum --> BuildGraph
BuildGraph --> Builder
Builder --> ComputeThe relevant pieces:
- Enum.
enum llm_archinsrc/llama-arch.h(LLM_ARCH_LLAMA,LLM_ARCH_FALCON,LLM_ARCH_GEMMA3,LLM_ARCH_QWEN3,LLM_ARCH_MAMBA,LLM_ARCH_RWKV6, ...). String mappingLLM_ARCH_NAMESinsrc/llama-arch.cpp. - Tensor manifest. Each enum value has a row in the
LLM_TENSOR_NAMEStable insidesrc/llama-arch.cppdescribing the canonical GGUF tensor names (token_embd.weight,blk.{N}.attn_norm.weight, ...). The loader uses this to find the right tensors. - Per-architecture graph builder. Files in
src/models/such asllama.cpp,gemma3.cpp,qwen3.cpp,phi3.cpp,mamba.cpp,rwkv6.cpp,bitnet.cpp, and so on. Each defines allm_build_*class derived from helpers insrc/llama-graph.h. - Dispatcher.
llama-graph.cpp(andllama-context.cppfor the choice of memory backend) readsmodel->archand constructs the matching builder.
Directory layout
src/
├── llama-arch.h # llm_arch enum, LLM_TENSOR_* enum, LLM_KV_* enum
├── llama-arch.cpp # arch name strings, LLM_TENSOR_NAMES table, KV name strings
├── llama-graph.h # graph-building helpers shared by all architectures
├── llama-graph.cpp # entry point that dispatches on arch
├── llama-model.cpp # tensor allocation per arch
└── models/
├── llama.cpp # llm_build_llama
├── llama4.cpp # llm_build_llama4
├── gemma.cpp / gemma2.cpp / gemma3.cpp
├── qwen.cpp / qwen2.cpp / qwen3.cpp / qwen3moe.cpp
├── phi2.cpp / phi3.cpp / phimoe.cpp
├── mistral.cpp / mixtral.cpp
├── falcon.cpp / mpt.cpp / refact.cpp / starcoder.cpp / bloom.cpp / bert.cpp
├── deepseek.cpp / deepseek2.cpp
├── command-r.cpp / command-r7b.cpp
├── granite.cpp / granitemoe.cpp / granitehybrid.cpp
├── mamba.cpp / mamba2.cpp / rwkv6.cpp / rwkv7.cpp
├── bitnet.cpp / bitnet-b1-58.cpp
├── glm4.cpp / glmedge.cpp / chatglm.cpp
├── internlm2.cpp / orion.cpp / yi.cpp / xverse.cpp
├── jamba.cpp / dbrx.cpp / openelm.cpp / olmo.cpp / olmo2.cpp / olmoe.cpp
├── snowflake-arctic.cpp / hunyuan.cpp / lfm2.cpp / lfm2vl.cpp
└── ... (~70 files total)Adding a new architecture
The canonical recipe lives in docs/development/HOWTO-add-model.md. The short version:
- Pick a name. Add it to the
llm_archenum insrc/llama-arch.hand toLLM_ARCH_NAMESinsrc/llama-arch.cpp. - Define the tensor manifest. Add a row to
LLM_TENSOR_NAMESinsrc/llama-arch.cpplisting every tensor name your model uses. - Add hparams fields if needed. Edit
src/llama-hparams.hfor any new hyperparameter. - Implement the graph. Create
src/models/<your-arch>.cppdefiningllm_build_<arch>. Use the helpers insrc/llama-graph.hto keep it consistent with neighboring architectures. - Allocate tensors. Add a branch in
src/llama-model.cpp::load_tensorsthat allocates per-layer tensors with the right shapes and types. - Wire the dispatcher.
src/llama-graph.cppandsrc/llama-context.cppneed new branches selecting your builder and the right memory backend (KV-cache vs recurrent vs hybrid). - Conversion. Add a
Modelsubclass inconvert_hf_to_gguf.pythat emits the right tensor names and metadata. - Test. Convert a small checkpoint, run
llama-cliwith it, runllama-perplexity, post numbers in the PR.
CONTRIBUTING.md notes that new model PRs should focus on CPU-only support; backend-specific kernels can follow.
Why this layout works
The architecture switch keeps llama_model and llama_context agnostic of any single architecture. Adding a new family generally means adding files, not editing existing ones — except for the dispatch sites, which are intentionally short. Reviewers can read a single src/models/<arch>.cpp and have the entire graph for that family in one place.
Integration points
- Loader. Reads
general.architecture, looks upllm_arch, asksllama_modelto allocate using the matching tensor manifest. See Model loader. - Graph builder.
llama-graph.cppcalls into the per-arch builder duringllama_decode. - Memory backend. Standard models use
llama-kv-cache. Sliding-window models (Gemma 2/3, some Qwen variants, Phi-3) usellama-kv-cache-iswa. SSMs (Mamba, RWKV) usellama-memory-recurrent. Hybrids (Jamba, Granite Hybrid) usellama-memory-hybrid/llama-memory-hybrid-iswa. See KV cache and memory. - Conversion.
convert_hf_to_gguf.pyis the inverse of this whole subsystem on the Python side.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Model loader
Next
Vocab and tokenizer