ggml-org/llama.cpp
Chat templates
Active contributors: Xuan-Son Nguyen, CISC, Sigbjørn Skjæret
Chat templates turn structured {role, content} messages into the exact prompt format a given model was trained with. llama.cpp has two layers of chat-template support: a small built-in dispatcher in src/llama-chat.cpp for known formats, and a full Jinja engine in common/chat.cpp + common/jinja/ that executes whatever template the model ships in its GGUF metadata.
Purpose
- Render
(messages, tools, system)into a prompt string the model expects. - Parse the model's output back into structured assistant messages, including tool/function calls (the "autoparser").
Two-layer design
graph TD
Input[messages + tools]
LlamaChat[src/llama-chat.cpp built-in templates]
CommonChat[common/chat.cpp Jinja-based templates]
Out[rendered prompt]
Parser[common/chat-auto-parser*.cpp]
Structured[parsed assistant message]
Input -->|known template name| LlamaChat
LlamaChat --> Out
Input -->|GGUF jinja template| CommonChat
CommonChat --> Out
Out --> Model[LLM generation]
Model --> Parser
Parser --> Structuredsrc/llama-chat.cpp (built-in)
This is the lightweight path. It enumerates a known set of templates (LLM_CHAT_TEMPLATE_* enum) and renders messages directly in C++. It supports the common ones (LLaMA-2 / 3, ChatML, Phi, Vicuna, OpenChat, Falcon, Gemma, Granite, Command-R, ...) without needing Jinja. Used when:
- The user asks for a known template by name (
--chat-template chatml). - The model's GGUF embeds a template name rather than full Jinja source.
- The Jinja path is disabled or fails.
common/chat.cpp + common/jinja/
The full path. The model's GGUF metadata typically embeds a Jinja2 template under the key tokenizer.chat_template. common/chat.cpp feeds that template plus the conversation into the vendored Jinja engine in common/jinja/ (a port of google/minja) and gets back the formatted prompt.
This is what makes "any model with a chat template just works" possible — including custom and rare formats that nobody has hard-coded in C++.
Tool / function calling
Tool support is layered on top of templating. The relevant files are:
| File | Purpose |
|---|---|
common/chat.cpp |
Renders prompts with tools parameter; dispatches per-format quirks |
common/chat-auto-parser.h, .cpp |
Streaming parser for assistant output; auto-detects model format |
common/chat-auto-parser-generator.cpp |
Generates a model-specific parser from format metadata |
common/chat-peg-parser.cpp, .h |
PEG-based extraction of tool-call payloads |
common/chat-diff-analyzer.cpp |
Diff helper for incremental streaming output |
common/peg-parser.cpp, .h |
The general-purpose PEG engine (~82 KB) |
docs/autoparser.md |
Long-form design doc |
docs/function-calling.md |
User-facing how-to |
docs/development/parsing.md |
PEG parser doc |
The autoparser handles incremental, streaming output: it can return partial tool-call JSON to clients before the full call has been emitted, which is what llama-server exposes via OpenAI-compatible streaming.
Reference templates
models/templates/ contains canonical Jinja templates for many model families. They are used by tests (tests/test-chat-template.cpp) and as defaults when a model doesn't ship one.
How a server request flows
sequenceDiagram
participant Client
participant Server as tools/server
participant Chat as common/chat.cpp
participant Llama as libllama
participant Parser as autoparser
Client->>Server: POST /v1/chat/completions {messages, tools}
Server->>Chat: render(messages, tools, gguf_template)
Chat-->>Server: prompt string
Server->>Llama: tokenize + decode (streaming)
Llama-->>Server: tokens
Server->>Parser: feed token deltas
Parser-->>Server: partial / final assistant message + tool calls
Server-->>Client: SSE chunksIntegration points
- Server.
tools/server/server-chat.cppis the orchestrator — it owns the conversation, picks the template, and consults the autoparser. - CLI.
tools/cli/cli.cppusescommon/chat.cppfor-cnv(conversation) mode. - Sampling. Tool-calling typically pairs with a grammar (lazy GBNF or llguidance) so the model can only emit syntactically valid calls.
- GGUF metadata. Templates are read from
tokenizer.chat_templateand friends bysrc/llama-model-loader.cpp.
Entry points for modification
- New built-in template. Add an
LLM_CHAT_TEMPLATE_*enum value insrc/llama-chat.hand a renderer branch insrc/llama-chat.cpp. Add a test fixture intests/test-chat-template.cpp. - Improve Jinja parity. Edit
common/jinja/and add a regression case totests/test-chat-template.cpp. - New tool-call format. Add detection rules to
common/chat-auto-parser*.cppand a fixture intests/test-chat-parser.cpp. The PEG grammar lives incommon/chat-peg-parser.cpp. - Reasoning models. Reasoning budget / thinking tags are handled in
common/reasoning-budget.cpp; check that file for "thinking-aware" output handling.
Tests
tests/test-chat.cpp,test-chat-template.cpp,test-chat-parser.cppcover the C++ side.tests/peg-parser/holds snapshot tests for the PEG engine.tools/server/tests/exercises end-to-end behavior including tool calls.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Grammar
Next
Quantization