Factory.ai

Open-Source Wikis

/

llama.cpp

/

Reference

/

Configuration

ggml-org/llama.cpp

Configuration

Authoritative configuration lives in C structs declared in include/llama.h. CLI flags in common/arg.cpp map onto these one-to-one. This page is a high-level pointer; for exact field semantics, read the header comments.

llama_model_params (load-time)

Defined in include/llama.h. Notable fields:

Field Effect
n_gpu_layers Layers to offload to GPU. 99 typically offloads everything.
main_gpu Which GPU holds shared/intermediate tensors
tensor_split Per-GPU weight split fraction array
vocab_only Load vocab but skip tensor data — fast for tokenizer experiments
use_mmap Default true; --no-mmap flips it
use_mlock Pin model memory; --mlock
check_tensors Validate numerics on load; --check-tensors
kv_overrides Array of llama_model_kv_override to patch GGUF metadata at load time; CLI is --override-kv k=t:v
tensor_buft_overrides Force a specific buffer type (CPU vs GPU) per tensor pattern
progress_callback Optional load-progress callback

Constructor: llama_model_default_params().

llama_context_params (runtime)

Notable fields:

Field Effect
n_ctx Max context length the cache can hold
n_batch Logical batch size (max tokens per llama_decode)
n_ubatch Physical micro-batch size
n_seq_max Max concurrent sequences (slots)
n_threads, n_threads_batch Threadpool sizing for generation vs prompt processing
rope_scaling_type, rope_freq_base, rope_freq_scale, yarn_* Position-encoding overrides
defrag_thold Auto-defrag KV cache when fragmentation crosses this fraction
pooling_type Embedding pooling: NONE / MEAN / CLS / LAST / RANK
attention_type CAUSAL or NON_CAUSAL (for embeddings/encoders)
cache_type_k, cache_type_v KV-cache element types (-ctk, -ctv)
embeddings Set to true for embedding-only mode
offload_kqv Offload K/Q/V tensors to GPU when offloading
flash_attn Use flash-attention kernels (-fa)
no_perf Disable internal perf accounting

Constructor: llama_context_default_params().

llama_sampler_chain_params

Field Effect
no_perf Skip internal accounting

Constructor: llama_sampler_chain_default_params(). Most sampler config happens by composing llama_sampler_init_* calls — see Sampler.

llama_model_quantize_params

Field Effect
nthread Threads used during quantization
ftype Target quant recipe (LLAMA_FTYPE_MOSTLY_Q4_K_M, ...)
output_tensor_type, token_embedding_type Per-role overrides
allow_requantize Allow input that is already quantized
quantize_output_tensor If false, leave output.weight at source precision
only_copy Skip quantization, just remap
pure Quantize every tensor to the recipe type without per-tensor mixing
keep_split Preserve a split-GGUF input layout
imatrix Importance matrix file (or in-memory data)
kv_overrides Patch metadata on the way out
tensor_types Optional per-tensor type overrides
prune_layers Drop specific layer indices

Constructor: llama_model_quantize_default_params().

CLI groups

Tool flags are grouped in common/arg.cpp. The major groups:

  • Model-m, -mu, -hf, -md, --no-mmap, --mlock, --lora, --lora-scaled, --control-vector, --check-tensors, --override-kv.
  • Context-c, -n, -b, -ub, --keep, --rope-*, --yarn-*, -fa, -ctk, -ctv, --cache-reuse.
  • Threading and GPU-t, -tb, --cpu-mask, --cpu-range, -ngl, -mg, -ts, -sm, -ot.
  • Sampling--top-k, --top-p, --min-p, --typical, --temp, --temp-ext, --xtc-*, --repeat-*, --presence-penalty, --frequency-penalty, --dry-*, --mirostat-*, --samplers.
  • Constrained output--grammar, --grammar-file, --json-schema, --json-schema-file.
  • Conversation-cnv, -i, -if, --in-prefix, --in-suffix, --reverse-prompt, --chat-template, --chat-template-file.
  • Server--host, --port, --api-key, --ssl-*, --parallel, --cont-batching, --slot-save-path, --slot-prompt-similarity, --metrics, --no-webui.
  • Logging-v, --log-disable, --log-file, --log-prefix.

Run any tool with --help for the live list of flags actually compiled into your build.

Environment variables

Selected env vars honored by the runtime:

  • GGML_LOG_LEVEL, LLAMA_LOG_LEVEL, GGML_BACKEND_LOG_LEVEL — verbosity.
  • GGML_THREADPOOL_* — threadpool tuning.
  • GGML_BACKEND_DL_PATH — extra search path for backend plugins.
  • LLAMA_* and GGML_* build-time defines configured by CMake (visible in the --version banner of any tool, generated from common/build-info.cpp.in).

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Configuration – llama.cpp wiki | Factory