Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Adapters

ggml-org/llama.cpp

Adapters

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

llama.cpp supports two kinds of runtime weight modifications: LoRA adapters and control vectors. Both are loaded as separate GGUF files and applied on top of a base model without rewriting the base weights.

Purpose

  • Load LoRA adapter weights and apply them to the base model's matmuls.
  • Load control vectors (per-layer biases) used for "steering" experiments.
  • Allow multiple adapters to be hot-swapped or scaled per request.

Key abstractions

Type Role File
llama_adapter_lora A loaded LoRA: per-layer A/B factor pairs + scale src/llama-adapter.h
llama_adapter_cvec Per-layer control vector src/llama-adapter.h
llama_adapter_lora_init, _free, _get_alora_invocation_tokens Public C API include/llama.h
llama_set_adapter_lora(ctx, lora, scale) Activate (or deactivate with scale=0) a LoRA on a context include/llama.h
build_lora_mm Graph helper that injects active LoRA matmuls src/llama-graph.h, src/llama-graph.cpp

src/llama-adapter.cpp (~18 KB) handles the loading and the per-tensor matching against the base model.

How LoRA application works

graph TD
    Base[Base weight W] --> Mat[ggml_mul_mat with input X]
    Lora[LoRA A and B] --> Apply{any LoRA active?}
    Apply -->|no| Mat
    Apply -->|yes| Add[X@W + scale * X@A@B]
    Mat --> Out[layer output]
    Add --> Out

build_lora_mm in src/llama-graph.cpp is called by every per-architecture builder when it does a weight matmul. The helper looks up whether a LoRA is loaded for that weight, and if so emits the additional A @ B matmuls and adds them with the configured scale.

File format

LoRA adapter files are GGUFs with a small set of metadata keys (general.type = "adapter", adapter.type = "lora", adapter.alpha, adapter.lora.tokens for activated LoRA) and per-layer tensors named to match the base model's tensor manifest. convert_lora_to_gguf.py is the conversion path from HuggingFace PEFT exports.

Control vectors use a similar GGUF layout and are produced by tools/cvector-generator/.

Integration points

  • CLI. Tools accept --lora, --lora-scaled, --lora-init-without-apply, and --control-vector. See common/arg.cpp.
  • Server. tools/server exposes per-request LoRA scaling through its API surface.
  • Graph builder. Every weight matmul goes through build_lora_mm, so no per-architecture code changes are needed to support LoRA.
  • Saver. src/llama-model-saver.cpp can merge LoRA into the base weights and write a new GGUF, which llama-export-lora exposes.

Activated LoRA (aLoRA)

Some adapters only fire after specific "invocation tokens" appear in the prompt. llama_adapter_lora_get_alora_invocation_tokens returns the trigger tokens stored in the GGUF metadata; tools/server and llama-cli can use this to switch the LoRA on/off mid-generation.

Entry points for modification

  • New adapter type. Add the GGUF metadata keys, extend llama_adapter_* types, and inject the application logic into build_lora_mm (or a sibling helper).
  • Per-token scaling. The current API uses a single scalar; adding token-level scaling would require new graph plumbing.
  • Conversion. convert_lora_to_gguf.py is the place to support new HuggingFace PEFT formats.
  • tools/cvector-generator/ — produces control-vector GGUFs.
  • tools/export-lora/ — merges a LoRA into a base GGUF and writes the result.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Adapters – llama.cpp wiki | Factory