ggml-org/llama.cpp
Sampler
Active contributors: Georgi Gerganov
Sampling is the step that turns logits from llama_decode into a chosen token. llama.cpp models samplers as a chain of pluggable nodes: each node takes a token-array-with-scores and either reorders it, prunes it, or finally selects a single token. The default chain reproduces the classic top-k → top-p → temperature pipeline, but every step is interchangeable.
Purpose
- Provide a uniform
llama_samplerinterface for token selection. - Ship the well-known sampling algorithms (top-k, top-p, min-p, typical, mirostat, temperature, repetition penalty, presence penalty, DRY, XTC, ...) as building blocks.
- Let users compose them into a chain and run that chain in a single call.
Key abstractions
| Type | Role | File |
|---|---|---|
llama_sampler |
Opaque sampler with name, accept, apply, reset, clone, free vtable |
include/llama.h, src/llama-sampler.cpp |
llama_sampler_chain |
Sampler that owns a list of child samplers | src/llama-sampler.cpp |
llama_token_data |
(id, logit, p) triple |
include/llama.h |
llama_token_data_array |
Mutable view of candidates passed through the chain | include/llama.h |
Each built-in sampler is a small struct with the vtable filled in (e.g. llama_sampler_top_k, llama_sampler_top_p, llama_sampler_temp, llama_sampler_mirostat_v2, llama_sampler_grammar).
How it works
graph LR
Logits["logits from llama_decode"] -->|llama_sampler_apply| Chain
subgraph Chain["llama_sampler_chain"]
S1[penalties] --> S2[top-k]
S2 --> S3[top-p / min-p / typical]
S3 --> S4[temperature / dynamic temp]
S4 --> S5[grammar / DRY / XTC]
S5 --> S6[dist or greedy]
end
Chain -->|llama_sampler_sample| Token[next token]A chain runs every child's apply against the current llama_token_data_array, then the chain's last node finalizes a single token (typically dist for sampling or greedy for argmax).
After each sampled token, callers call llama_sampler_accept(chain, token) so that stateful samplers (penalty trackers, grammar state, mirostat surprise) can update.
Built-in samplers
| Sampler | Effect |
|---|---|
top_k(k) |
Keep the top k tokens by logit |
top_p(p) |
Keep the smallest token set whose cumulative probability ≥ p |
min_p(p) |
Keep tokens whose probability ≥ p × p_max |
typical(p) |
Locally typical sampling (Hewitt et al.) |
temp(t), temp_ext(t, range, exponent) |
Static / dynamic temperature |
xtc(p, t) |
"Exclude top choices" — prune dominant tokens to encourage diversity |
mirostat(...), mirostat_v2(...) |
Mirostat target-perplexity sampling |
penalties(...), dry(...) |
Repetition penalty, DRY repetition control |
top_n_sigma(n) |
Top-n by sigma above mean |
grammar(grammar) |
Mask tokens forbidden by a GBNF/llguidance grammar |
infill(...) |
Mask non-fill tokens for FIM completion |
logit_bias(bias[]) |
Per-token additive bias |
dist(seed) |
Final categorical sample |
greedy |
Final argmax |
User-facing CLI flags map onto these one-to-one. Default presets live in common/preset.cpp.
Default chain
common/sampling.cpp builds the canonical chain used by llama-cli and llama-server. It's roughly:
penalties → DRY → top-k → typical → top-p → min-p → xtc → temp(_ext) → grammar → distThe order matters — penalties run before any pruning so that pruned tokens are still penalized for next time, and grammar runs at the end so it can mask whatever survived the other filters.
Constrained sampling
Three different routes feed the grammar sampler:
- A user-supplied
.gbnffile (--grammar-file). - A JSON Schema converted to GBNF via
common/json-schema-to-grammar.cpp. - The optional Rust-based
llguidanceengine, integrated throughcommon/llguidance.cpp(build withLLAMA_LLGUIDANCE).
See Grammar for the parser side.
Tool / function-call extraction
Sampling produces tokens; turning the model's output into structured tool calls is a separate parsing step in common/chat-auto-parser*.cpp, common/chat-peg-parser.cpp, and common/chat-diff-analyzer.cpp. See Chat templates and docs/autoparser.md.
Integration points
llama-cli— builds a chain from CLI flags viacommon/sampling.cpp.llama-server— same path; per-slot samplers are owned by the server's slot struct.- Speculative decoding — needs deterministic comparison; uses dedicated wrappers in
common/speculative.cppplus adistsampler with a pinned seed. - Embedding mode — disables sampling entirely; only logits/embeddings are returned.
Entry points for modification
- New sampler. Add a constructor in
src/llama-sampler.cpp, expose it ininclude/llama.h, wire it intocommon/sampling.cppif you want CLI access. Reference the existing simple samplers (e.g.temp) for the vtable shape. - New chain order. Edit
common/sampling.cpp::common_sampler_init. - State migration. Implement
cloneif your sampler holds state (penalty trackers, mirostat history) so multiple sequences don't share it.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
KV cache and memory
Next
Grammar