ggml-org/llama.cpp

Sampler

Active contributors: Georgi Gerganov

Sampling is the step that turns logits from llama_decode into a chosen token. llama.cpp models samplers as a chain of pluggable nodes: each node takes a token-array-with-scores and either reorders it, prunes it, or finally selects a single token. The default chain reproduces the classic top-k → top-p → temperature pipeline, but every step is interchangeable.

Purpose

Provide a uniform llama_sampler interface for token selection.
Ship the well-known sampling algorithms (top-k, top-p, min-p, typical, mirostat, temperature, repetition penalty, presence penalty, DRY, XTC, ...) as building blocks.
Let users compose them into a chain and run that chain in a single call.

Key abstractions

Type	Role	File
`llama_sampler`	Opaque sampler with `name`, `accept`, `apply`, `reset`, `clone`, `free` vtable	`include/llama.h`, `src/llama-sampler.cpp`
`llama_sampler_chain`	Sampler that owns a list of child samplers	`src/llama-sampler.cpp`
`llama_token_data`	`(id, logit, p)` triple	`include/llama.h`
`llama_token_data_array`	Mutable view of candidates passed through the chain	`include/llama.h`

Each built-in sampler is a small struct with the vtable filled in (e.g. llama_sampler_top_k, llama_sampler_top_p, llama_sampler_temp, llama_sampler_mirostat_v2, llama_sampler_grammar).

How it works

graph LR
    Logits["logits from llama_decode"] -->|llama_sampler_apply| Chain
    subgraph Chain["llama_sampler_chain"]
        S1[penalties] --> S2[top-k]
        S2 --> S3[top-p / min-p / typical]
        S3 --> S4[temperature / dynamic temp]
        S4 --> S5[grammar / DRY / XTC]
        S5 --> S6[dist or greedy]
    end
    Chain -->|llama_sampler_sample| Token[next token]

A chain runs every child's apply against the current llama_token_data_array, then the chain's last node finalizes a single token (typically dist for sampling or greedy for argmax).

After each sampled token, callers call llama_sampler_accept(chain, token) so that stateful samplers (penalty trackers, grammar state, mirostat surprise) can update.

Built-in samplers

Sampler	Effect
`top_k(k)`	Keep the top k tokens by logit
`top_p(p)`	Keep the smallest token set whose cumulative probability ≥ p
`min_p(p)`	Keep tokens whose probability ≥ p × p_max
`typical(p)`	Locally typical sampling (Hewitt et al.)
`temp(t)`, `temp_ext(t, range, exponent)`	Static / dynamic temperature
`xtc(p, t)`	"Exclude top choices" — prune dominant tokens to encourage diversity
`mirostat(...)`, `mirostat_v2(...)`	Mirostat target-perplexity sampling
`penalties(...)`, `dry(...)`	Repetition penalty, DRY repetition control
`top_n_sigma(n)`	Top-n by sigma above mean
`grammar(grammar)`	Mask tokens forbidden by a GBNF/llguidance grammar
`infill(...)`	Mask non-fill tokens for FIM completion
`logit_bias(bias[])`	Per-token additive bias
`dist(seed)`	Final categorical sample
`greedy`	Final argmax

User-facing CLI flags map onto these one-to-one. Default presets live in common/preset.cpp.

Default chain

common/sampling.cpp builds the canonical chain used by llama-cli and llama-server. It's roughly:

penalties → DRY → top-k → typical → top-p → min-p → xtc → temp(_ext) → grammar → dist

The order matters — penalties run before any pruning so that pruned tokens are still penalized for next time, and grammar runs at the end so it can mask whatever survived the other filters.

Constrained sampling

Three different routes feed the grammar sampler:

A user-supplied .gbnf file (--grammar-file).
A JSON Schema converted to GBNF via common/json-schema-to-grammar.cpp.
The optional Rust-based llguidance engine, integrated through common/llguidance.cpp (build with LLAMA_LLGUIDANCE).

See Grammar for the parser side.

Tool / function-call extraction

Sampling produces tokens; turning the model's output into structured tool calls is a separate parsing step in common/chat-auto-parser*.cpp, common/chat-peg-parser.cpp, and common/chat-diff-analyzer.cpp. See Chat templates and docs/autoparser.md.

Integration points

llama-cli — builds a chain from CLI flags via common/sampling.cpp.
llama-server — same path; per-slot samplers are owned by the server's slot struct.
Speculative decoding — needs deterministic comparison; uses dedicated wrappers in common/speculative.cpp plus a dist sampler with a pinned seed.
Embedding mode — disables sampling entirely; only logits/embeddings are returned.

Entry points for modification

New sampler. Add a constructor in src/llama-sampler.cpp, expose it in include/llama.h, wire it into common/sampling.cpp if you want CLI access. Reference the existing simple samplers (e.g. temp) for the vtable shape.
New chain order. Edit common/sampling.cpp::common_sampler_init.
State migration. Implement clone if your sampler holds state (penalty trackers, mirostat history) so multiple sequences don't share it.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.