ggml-org/llama.cpp

Configuration

Authoritative configuration lives in C structs declared in include/llama.h. CLI flags in common/arg.cpp map onto these one-to-one. This page is a high-level pointer; for exact field semantics, read the header comments.

`llama_model_params` (load-time)

Defined in include/llama.h. Notable fields:

Field	Effect
`n_gpu_layers`	Layers to offload to GPU. `99` typically offloads everything.
`main_gpu`	Which GPU holds shared/intermediate tensors
`tensor_split`	Per-GPU weight split fraction array
`vocab_only`	Load vocab but skip tensor data — fast for tokenizer experiments
`use_mmap`	Default `true`; `--no-mmap` flips it
`use_mlock`	Pin model memory; `--mlock`
`check_tensors`	Validate numerics on load; `--check-tensors`
`kv_overrides`	Array of `llama_model_kv_override` to patch GGUF metadata at load time; CLI is `--override-kv k=t:v`
`tensor_buft_overrides`	Force a specific buffer type (CPU vs GPU) per tensor pattern
`progress_callback`	Optional load-progress callback

Constructor: llama_model_default_params().

`llama_context_params` (runtime)

Notable fields:

Field	Effect
`n_ctx`	Max context length the cache can hold
`n_batch`	Logical batch size (max tokens per `llama_decode`)
`n_ubatch`	Physical micro-batch size
`n_seq_max`	Max concurrent sequences (slots)
`n_threads`, `n_threads_batch`	Threadpool sizing for generation vs prompt processing
`rope_scaling_type`, `rope_freq_base`, `rope_freq_scale`, `yarn_*`	Position-encoding overrides
`defrag_thold`	Auto-defrag KV cache when fragmentation crosses this fraction
`pooling_type`	Embedding pooling: NONE / MEAN / CLS / LAST / RANK
`attention_type`	CAUSAL or NON_CAUSAL (for embeddings/encoders)
`cache_type_k`, `cache_type_v`	KV-cache element types (`-ctk`, `-ctv`)
`embeddings`	Set to `true` for embedding-only mode
`offload_kqv`	Offload K/Q/V tensors to GPU when offloading
`flash_attn`	Use flash-attention kernels (`-fa`)
`no_perf`	Disable internal perf accounting

Constructor: llama_context_default_params().

`llama_sampler_chain_params`

Field	Effect
`no_perf`	Skip internal accounting

Constructor: llama_sampler_chain_default_params(). Most sampler config happens by composing llama_sampler_init_* calls — see Sampler.

`llama_model_quantize_params`

Field	Effect
`nthread`	Threads used during quantization
`ftype`	Target quant recipe (`LLAMA_FTYPE_MOSTLY_Q4_K_M`, ...)
`output_tensor_type`, `token_embedding_type`	Per-role overrides
`allow_requantize`	Allow input that is already quantized
`quantize_output_tensor`	If `false`, leave `output.weight` at source precision
`only_copy`	Skip quantization, just remap
`pure`	Quantize every tensor to the recipe type without per-tensor mixing
`keep_split`	Preserve a split-GGUF input layout
`imatrix`	Importance matrix file (or in-memory data)
`kv_overrides`	Patch metadata on the way out
`tensor_types`	Optional per-tensor type overrides
`prune_layers`	Drop specific layer indices

Constructor: llama_model_quantize_default_params().

CLI groups

Tool flags are grouped in common/arg.cpp. The major groups:

Model — -m, -mu, -hf, -md, --no-mmap, --mlock, --lora, --lora-scaled, --control-vector, --check-tensors, --override-kv.
Context — -c, -n, -b, -ub, --keep, --rope-*, --yarn-*, -fa, -ctk, -ctv, --cache-reuse.
Threading and GPU — -t, -tb, --cpu-mask, --cpu-range, -ngl, -mg, -ts, -sm, -ot.
Sampling — --top-k, --top-p, --min-p, --typical, --temp, --temp-ext, --xtc-*, --repeat-*, --presence-penalty, --frequency-penalty, --dry-*, --mirostat-*, --samplers.
Constrained output — --grammar, --grammar-file, --json-schema, --json-schema-file.
Conversation — -cnv, -i, -if, --in-prefix, --in-suffix, --reverse-prompt, --chat-template, --chat-template-file.
Server — --host, --port, --api-key, --ssl-*, --parallel, --cont-batching, --slot-save-path, --slot-prompt-similarity, --metrics, --no-webui.
Logging — -v, --log-disable, --log-file, --log-prefix.

Run any tool with --help for the live list of flags actually compiled into your build.

Environment variables

Selected env vars honored by the runtime:

GGML_LOG_LEVEL, LLAMA_LOG_LEVEL, GGML_BACKEND_LOG_LEVEL — verbosity.
GGML_THREADPOOL_* — threadpool tuning.
GGML_BACKEND_DL_PATH — extra search path for backend plugins.
LLAMA_* and GGML_* build-time defines configured by CMake (visible in the --version banner of any tool, generated from common/build-info.cpp.in).

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

llama_model_params (load-time)

llama_context_params (runtime)

llama_sampler_chain_params