ggml-org/llama.cpp

Chat templates

Active contributors: Xuan-Son Nguyen, CISC, Sigbjørn Skjæret

Chat templates turn structured {role, content} messages into the exact prompt format a given model was trained with. llama.cpp has two layers of chat-template support: a small built-in dispatcher in src/llama-chat.cpp for known formats, and a full Jinja engine in common/chat.cpp + common/jinja/ that executes whatever template the model ships in its GGUF metadata.

Purpose

Render (messages, tools, system) into a prompt string the model expects.
Parse the model's output back into structured assistant messages, including tool/function calls (the "autoparser").

Two-layer design

graph TD
    Input[messages + tools]
    LlamaChat[src/llama-chat.cpp built-in templates]
    CommonChat[common/chat.cpp Jinja-based templates]
    Out[rendered prompt]
    Parser[common/chat-auto-parser*.cpp]
    Structured[parsed assistant message]

    Input -->|known template name| LlamaChat
    LlamaChat --> Out
    Input -->|GGUF jinja template| CommonChat
    CommonChat --> Out
    Out --> Model[LLM generation]
    Model --> Parser
    Parser --> Structured

`src/llama-chat.cpp` (built-in)

This is the lightweight path. It enumerates a known set of templates (LLM_CHAT_TEMPLATE_* enum) and renders messages directly in C++. It supports the common ones (LLaMA-2 / 3, ChatML, Phi, Vicuna, OpenChat, Falcon, Gemma, Granite, Command-R, ...) without needing Jinja. Used when:

The user asks for a known template by name (--chat-template chatml).
The model's GGUF embeds a template name rather than full Jinja source.
The Jinja path is disabled or fails.

`common/chat.cpp` + `common/jinja/`

The full path. The model's GGUF metadata typically embeds a Jinja2 template under the key tokenizer.chat_template. common/chat.cpp feeds that template plus the conversation into the vendored Jinja engine in common/jinja/ (a port of google/minja) and gets back the formatted prompt.

This is what makes "any model with a chat template just works" possible — including custom and rare formats that nobody has hard-coded in C++.

Tool / function calling

Tool support is layered on top of templating. The relevant files are:

File	Purpose
`common/chat.cpp`	Renders prompts with `tools` parameter; dispatches per-format quirks
`common/chat-auto-parser.h`, `.cpp`	Streaming parser for assistant output; auto-detects model format
`common/chat-auto-parser-generator.cpp`	Generates a model-specific parser from format metadata
`common/chat-peg-parser.cpp`, `.h`	PEG-based extraction of tool-call payloads
`common/chat-diff-analyzer.cpp`	Diff helper for incremental streaming output
`common/peg-parser.cpp`, `.h`	The general-purpose PEG engine (~82 KB)
`docs/autoparser.md`	Long-form design doc
`docs/function-calling.md`	User-facing how-to
`docs/development/parsing.md`	PEG parser doc

The autoparser handles incremental, streaming output: it can return partial tool-call JSON to clients before the full call has been emitted, which is what llama-server exposes via OpenAI-compatible streaming.

Reference templates

models/templates/ contains canonical Jinja templates for many model families. They are used by tests (tests/test-chat-template.cpp) and as defaults when a model doesn't ship one.

How a server request flows

sequenceDiagram
    participant Client
    participant Server as tools/server
    participant Chat as common/chat.cpp
    participant Llama as libllama
    participant Parser as autoparser

    Client->>Server: POST /v1/chat/completions {messages, tools}
    Server->>Chat: render(messages, tools, gguf_template)
    Chat-->>Server: prompt string
    Server->>Llama: tokenize + decode (streaming)
    Llama-->>Server: tokens
    Server->>Parser: feed token deltas
    Parser-->>Server: partial / final assistant message + tool calls
    Server-->>Client: SSE chunks

Integration points

Server. tools/server/server-chat.cpp is the orchestrator — it owns the conversation, picks the template, and consults the autoparser.
CLI. tools/cli/cli.cpp uses common/chat.cpp for -cnv (conversation) mode.
Sampling. Tool-calling typically pairs with a grammar (lazy GBNF or llguidance) so the model can only emit syntactically valid calls.
GGUF metadata. Templates are read from tokenizer.chat_template and friends by src/llama-model-loader.cpp.

Entry points for modification

New built-in template. Add an LLM_CHAT_TEMPLATE_* enum value in src/llama-chat.h and a renderer branch in src/llama-chat.cpp. Add a test fixture in tests/test-chat-template.cpp.
Improve Jinja parity. Edit common/jinja/ and add a regression case to tests/test-chat-template.cpp.
New tool-call format. Add detection rules to common/chat-auto-parser*.cpp and a fixture in tests/test-chat-parser.cpp. The PEG grammar lives in common/chat-peg-parser.cpp.
Reasoning models. Reasoning budget / thinking tags are handled in common/reasoning-budget.cpp; check that file for "thinking-aware" output handling.

Tests

tests/test-chat.cpp, test-chat-template.cpp, test-chat-parser.cpp cover the C++ side.
tests/peg-parser/ holds snapshot tests for the PEG engine.
tools/server/tests/ exercises end-to-end behavior including tool calls.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.