ggml-org/llama.cpp
Configuration
Authoritative configuration lives in C structs declared in include/llama.h. CLI flags in common/arg.cpp map onto these one-to-one. This page is a high-level pointer; for exact field semantics, read the header comments.
llama_model_params (load-time)
Defined in include/llama.h. Notable fields:
| Field | Effect |
|---|---|
n_gpu_layers |
Layers to offload to GPU. 99 typically offloads everything. |
main_gpu |
Which GPU holds shared/intermediate tensors |
tensor_split |
Per-GPU weight split fraction array |
vocab_only |
Load vocab but skip tensor data — fast for tokenizer experiments |
use_mmap |
Default true; --no-mmap flips it |
use_mlock |
Pin model memory; --mlock |
check_tensors |
Validate numerics on load; --check-tensors |
kv_overrides |
Array of llama_model_kv_override to patch GGUF metadata at load time; CLI is --override-kv k=t:v |
tensor_buft_overrides |
Force a specific buffer type (CPU vs GPU) per tensor pattern |
progress_callback |
Optional load-progress callback |
Constructor: llama_model_default_params().
llama_context_params (runtime)
Notable fields:
| Field | Effect |
|---|---|
n_ctx |
Max context length the cache can hold |
n_batch |
Logical batch size (max tokens per llama_decode) |
n_ubatch |
Physical micro-batch size |
n_seq_max |
Max concurrent sequences (slots) |
n_threads, n_threads_batch |
Threadpool sizing for generation vs prompt processing |
rope_scaling_type, rope_freq_base, rope_freq_scale, yarn_* |
Position-encoding overrides |
defrag_thold |
Auto-defrag KV cache when fragmentation crosses this fraction |
pooling_type |
Embedding pooling: NONE / MEAN / CLS / LAST / RANK |
attention_type |
CAUSAL or NON_CAUSAL (for embeddings/encoders) |
cache_type_k, cache_type_v |
KV-cache element types (-ctk, -ctv) |
embeddings |
Set to true for embedding-only mode |
offload_kqv |
Offload K/Q/V tensors to GPU when offloading |
flash_attn |
Use flash-attention kernels (-fa) |
no_perf |
Disable internal perf accounting |
Constructor: llama_context_default_params().
llama_sampler_chain_params
| Field | Effect |
|---|---|
no_perf |
Skip internal accounting |
Constructor: llama_sampler_chain_default_params(). Most sampler config happens by composing llama_sampler_init_* calls — see Sampler.
llama_model_quantize_params
| Field | Effect |
|---|---|
nthread |
Threads used during quantization |
ftype |
Target quant recipe (LLAMA_FTYPE_MOSTLY_Q4_K_M, ...) |
output_tensor_type, token_embedding_type |
Per-role overrides |
allow_requantize |
Allow input that is already quantized |
quantize_output_tensor |
If false, leave output.weight at source precision |
only_copy |
Skip quantization, just remap |
pure |
Quantize every tensor to the recipe type without per-tensor mixing |
keep_split |
Preserve a split-GGUF input layout |
imatrix |
Importance matrix file (or in-memory data) |
kv_overrides |
Patch metadata on the way out |
tensor_types |
Optional per-tensor type overrides |
prune_layers |
Drop specific layer indices |
Constructor: llama_model_quantize_default_params().
CLI groups
Tool flags are grouped in common/arg.cpp. The major groups:
- Model —
-m,-mu,-hf,-md,--no-mmap,--mlock,--lora,--lora-scaled,--control-vector,--check-tensors,--override-kv. - Context —
-c,-n,-b,-ub,--keep,--rope-*,--yarn-*,-fa,-ctk,-ctv,--cache-reuse. - Threading and GPU —
-t,-tb,--cpu-mask,--cpu-range,-ngl,-mg,-ts,-sm,-ot. - Sampling —
--top-k,--top-p,--min-p,--typical,--temp,--temp-ext,--xtc-*,--repeat-*,--presence-penalty,--frequency-penalty,--dry-*,--mirostat-*,--samplers. - Constrained output —
--grammar,--grammar-file,--json-schema,--json-schema-file. - Conversation —
-cnv,-i,-if,--in-prefix,--in-suffix,--reverse-prompt,--chat-template,--chat-template-file. - Server —
--host,--port,--api-key,--ssl-*,--parallel,--cont-batching,--slot-save-path,--slot-prompt-similarity,--metrics,--no-webui. - Logging —
-v,--log-disable,--log-file,--log-prefix.
Run any tool with --help for the live list of flags actually compiled into your build.
Environment variables
Selected env vars honored by the runtime:
GGML_LOG_LEVEL,LLAMA_LOG_LEVEL,GGML_BACKEND_LOG_LEVEL— verbosity.GGML_THREADPOOL_*— threadpool tuning.GGML_BACKEND_DL_PATH— extra search path for backend plugins.LLAMA_*andGGML_*build-time defines configured by CMake (visible in the--versionbanner of any tool, generated fromcommon/build-info.cpp.in).
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Reference
Next
Data models