Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Chat templates

ggml-org/llama.cpp

Chat templates

Active contributors: Xuan-Son Nguyen, CISC, Sigbjørn Skjæret

Chat templates turn structured {role, content} messages into the exact prompt format a given model was trained with. llama.cpp has two layers of chat-template support: a small built-in dispatcher in src/llama-chat.cpp for known formats, and a full Jinja engine in common/chat.cpp + common/jinja/ that executes whatever template the model ships in its GGUF metadata.

Purpose

  • Render (messages, tools, system) into a prompt string the model expects.
  • Parse the model's output back into structured assistant messages, including tool/function calls (the "autoparser").

Two-layer design

graph TD
    Input[messages + tools]
    LlamaChat[src/llama-chat.cpp built-in templates]
    CommonChat[common/chat.cpp Jinja-based templates]
    Out[rendered prompt]
    Parser[common/chat-auto-parser*.cpp]
    Structured[parsed assistant message]

    Input -->|known template name| LlamaChat
    LlamaChat --> Out
    Input -->|GGUF jinja template| CommonChat
    CommonChat --> Out
    Out --> Model[LLM generation]
    Model --> Parser
    Parser --> Structured

src/llama-chat.cpp (built-in)

This is the lightweight path. It enumerates a known set of templates (LLM_CHAT_TEMPLATE_* enum) and renders messages directly in C++. It supports the common ones (LLaMA-2 / 3, ChatML, Phi, Vicuna, OpenChat, Falcon, Gemma, Granite, Command-R, ...) without needing Jinja. Used when:

  • The user asks for a known template by name (--chat-template chatml).
  • The model's GGUF embeds a template name rather than full Jinja source.
  • The Jinja path is disabled or fails.

common/chat.cpp + common/jinja/

The full path. The model's GGUF metadata typically embeds a Jinja2 template under the key tokenizer.chat_template. common/chat.cpp feeds that template plus the conversation into the vendored Jinja engine in common/jinja/ (a port of google/minja) and gets back the formatted prompt.

This is what makes "any model with a chat template just works" possible — including custom and rare formats that nobody has hard-coded in C++.

Tool / function calling

Tool support is layered on top of templating. The relevant files are:

File Purpose
common/chat.cpp Renders prompts with tools parameter; dispatches per-format quirks
common/chat-auto-parser.h, .cpp Streaming parser for assistant output; auto-detects model format
common/chat-auto-parser-generator.cpp Generates a model-specific parser from format metadata
common/chat-peg-parser.cpp, .h PEG-based extraction of tool-call payloads
common/chat-diff-analyzer.cpp Diff helper for incremental streaming output
common/peg-parser.cpp, .h The general-purpose PEG engine (~82 KB)
docs/autoparser.md Long-form design doc
docs/function-calling.md User-facing how-to
docs/development/parsing.md PEG parser doc

The autoparser handles incremental, streaming output: it can return partial tool-call JSON to clients before the full call has been emitted, which is what llama-server exposes via OpenAI-compatible streaming.

Reference templates

models/templates/ contains canonical Jinja templates for many model families. They are used by tests (tests/test-chat-template.cpp) and as defaults when a model doesn't ship one.

How a server request flows

sequenceDiagram
    participant Client
    participant Server as tools/server
    participant Chat as common/chat.cpp
    participant Llama as libllama
    participant Parser as autoparser

    Client->>Server: POST /v1/chat/completions {messages, tools}
    Server->>Chat: render(messages, tools, gguf_template)
    Chat-->>Server: prompt string
    Server->>Llama: tokenize + decode (streaming)
    Llama-->>Server: tokens
    Server->>Parser: feed token deltas
    Parser-->>Server: partial / final assistant message + tool calls
    Server-->>Client: SSE chunks

Integration points

  • Server. tools/server/server-chat.cpp is the orchestrator — it owns the conversation, picks the template, and consults the autoparser.
  • CLI. tools/cli/cli.cpp uses common/chat.cpp for -cnv (conversation) mode.
  • Sampling. Tool-calling typically pairs with a grammar (lazy GBNF or llguidance) so the model can only emit syntactically valid calls.
  • GGUF metadata. Templates are read from tokenizer.chat_template and friends by src/llama-model-loader.cpp.

Entry points for modification

  • New built-in template. Add an LLM_CHAT_TEMPLATE_* enum value in src/llama-chat.h and a renderer branch in src/llama-chat.cpp. Add a test fixture in tests/test-chat-template.cpp.
  • Improve Jinja parity. Edit common/jinja/ and add a regression case to tests/test-chat-template.cpp.
  • New tool-call format. Add detection rules to common/chat-auto-parser*.cpp and a fixture in tests/test-chat-parser.cpp. The PEG grammar lives in common/chat-peg-parser.cpp.
  • Reasoning models. Reasoning budget / thinking tags are handled in common/reasoning-budget.cpp; check that file for "thinking-aware" output handling.

Tests

  • tests/test-chat.cpp, test-chat-template.cpp, test-chat-parser.cpp cover the C++ side.
  • tests/peg-parser/ holds snapshot tests for the PEG engine.
  • tools/server/tests/ exercises end-to-end behavior including tool calls.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Chat templates – llama.cpp wiki | Factory