Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Sampler

ggml-org/llama.cpp

Sampler

Active contributors: Georgi Gerganov

Sampling is the step that turns logits from llama_decode into a chosen token. llama.cpp models samplers as a chain of pluggable nodes: each node takes a token-array-with-scores and either reorders it, prunes it, or finally selects a single token. The default chain reproduces the classic top-k → top-p → temperature pipeline, but every step is interchangeable.

Purpose

  • Provide a uniform llama_sampler interface for token selection.
  • Ship the well-known sampling algorithms (top-k, top-p, min-p, typical, mirostat, temperature, repetition penalty, presence penalty, DRY, XTC, ...) as building blocks.
  • Let users compose them into a chain and run that chain in a single call.

Key abstractions

Type Role File
llama_sampler Opaque sampler with name, accept, apply, reset, clone, free vtable include/llama.h, src/llama-sampler.cpp
llama_sampler_chain Sampler that owns a list of child samplers src/llama-sampler.cpp
llama_token_data (id, logit, p) triple include/llama.h
llama_token_data_array Mutable view of candidates passed through the chain include/llama.h

Each built-in sampler is a small struct with the vtable filled in (e.g. llama_sampler_top_k, llama_sampler_top_p, llama_sampler_temp, llama_sampler_mirostat_v2, llama_sampler_grammar).

How it works

graph LR
    Logits["logits from llama_decode"] -->|llama_sampler_apply| Chain
    subgraph Chain["llama_sampler_chain"]
        S1[penalties] --> S2[top-k]
        S2 --> S3[top-p / min-p / typical]
        S3 --> S4[temperature / dynamic temp]
        S4 --> S5[grammar / DRY / XTC]
        S5 --> S6[dist or greedy]
    end
    Chain -->|llama_sampler_sample| Token[next token]

A chain runs every child's apply against the current llama_token_data_array, then the chain's last node finalizes a single token (typically dist for sampling or greedy for argmax).

After each sampled token, callers call llama_sampler_accept(chain, token) so that stateful samplers (penalty trackers, grammar state, mirostat surprise) can update.

Built-in samplers

Sampler Effect
top_k(k) Keep the top k tokens by logit
top_p(p) Keep the smallest token set whose cumulative probability ≥ p
min_p(p) Keep tokens whose probability ≥ p × p_max
typical(p) Locally typical sampling (Hewitt et al.)
temp(t), temp_ext(t, range, exponent) Static / dynamic temperature
xtc(p, t) "Exclude top choices" — prune dominant tokens to encourage diversity
mirostat(...), mirostat_v2(...) Mirostat target-perplexity sampling
penalties(...), dry(...) Repetition penalty, DRY repetition control
top_n_sigma(n) Top-n by sigma above mean
grammar(grammar) Mask tokens forbidden by a GBNF/llguidance grammar
infill(...) Mask non-fill tokens for FIM completion
logit_bias(bias[]) Per-token additive bias
dist(seed) Final categorical sample
greedy Final argmax

User-facing CLI flags map onto these one-to-one. Default presets live in common/preset.cpp.

Default chain

common/sampling.cpp builds the canonical chain used by llama-cli and llama-server. It's roughly:

penalties → DRY → top-k → typical → top-p → min-p → xtc → temp(_ext) → grammar → dist

The order matters — penalties run before any pruning so that pruned tokens are still penalized for next time, and grammar runs at the end so it can mask whatever survived the other filters.

Constrained sampling

Three different routes feed the grammar sampler:

  1. A user-supplied .gbnf file (--grammar-file).
  2. A JSON Schema converted to GBNF via common/json-schema-to-grammar.cpp.
  3. The optional Rust-based llguidance engine, integrated through common/llguidance.cpp (build with LLAMA_LLGUIDANCE).

See Grammar for the parser side.

Tool / function-call extraction

Sampling produces tokens; turning the model's output into structured tool calls is a separate parsing step in common/chat-auto-parser*.cpp, common/chat-peg-parser.cpp, and common/chat-diff-analyzer.cpp. See Chat templates and docs/autoparser.md.

Integration points

  • llama-cli — builds a chain from CLI flags via common/sampling.cpp.
  • llama-server — same path; per-slot samplers are owned by the server's slot struct.
  • Speculative decoding — needs deterministic comparison; uses dedicated wrappers in common/speculative.cpp plus a dist sampler with a pinned seed.
  • Embedding mode — disables sampling entirely; only logits/embeddings are returned.

Entry points for modification

  • New sampler. Add a constructor in src/llama-sampler.cpp, expose it in include/llama.h, wire it into common/sampling.cpp if you want CLI access. Reference the existing simple samplers (e.g. temp) for the vtable shape.
  • New chain order. Edit common/sampling.cpp::common_sampler_init.
  • State migration. Implement clone if your sampler holds state (penalty trackers, mirostat history) so multiple sequences don't share it.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Sampler – llama.cpp wiki | Factory