ggml-org/llama.cpp

Adapters

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

llama.cpp supports two kinds of runtime weight modifications: LoRA adapters and control vectors. Both are loaded as separate GGUF files and applied on top of a base model without rewriting the base weights.

Purpose

Load LoRA adapter weights and apply them to the base model's matmuls.
Load control vectors (per-layer biases) used for "steering" experiments.
Allow multiple adapters to be hot-swapped or scaled per request.

Key abstractions

Type	Role	File
`llama_adapter_lora`	A loaded LoRA: per-layer A/B factor pairs + scale	`src/llama-adapter.h`
`llama_adapter_cvec`	Per-layer control vector	`src/llama-adapter.h`
`llama_adapter_lora_init`, `_free`, `_get_alora_invocation_tokens`	Public C API	`include/llama.h`
`llama_set_adapter_lora(ctx, lora, scale)`	Activate (or deactivate with `scale=0`) a LoRA on a context	`include/llama.h`
`build_lora_mm`	Graph helper that injects active LoRA matmuls	`src/llama-graph.h`, `src/llama-graph.cpp`

src/llama-adapter.cpp (~18 KB) handles the loading and the per-tensor matching against the base model.

How LoRA application works

graph TD
    Base[Base weight W] --> Mat[ggml_mul_mat with input X]
    Lora[LoRA A and B] --> Apply{any LoRA active?}
    Apply -->|no| Mat
    Apply -->|yes| Add[X@W + scale * X@A@B]
    Mat --> Out[layer output]
    Add --> Out

build_lora_mm in src/llama-graph.cpp is called by every per-architecture builder when it does a weight matmul. The helper looks up whether a LoRA is loaded for that weight, and if so emits the additional A @ B matmuls and adds them with the configured scale.

File format

LoRA adapter files are GGUFs with a small set of metadata keys (general.type = "adapter", adapter.type = "lora", adapter.alpha, adapter.lora.tokens for activated LoRA) and per-layer tensors named to match the base model's tensor manifest. convert_lora_to_gguf.py is the conversion path from HuggingFace PEFT exports.

Control vectors use a similar GGUF layout and are produced by tools/cvector-generator/.

Integration points

CLI. Tools accept --lora, --lora-scaled, --lora-init-without-apply, and --control-vector. See common/arg.cpp.
Server. tools/server exposes per-request LoRA scaling through its API surface.
Graph builder. Every weight matmul goes through build_lora_mm, so no per-architecture code changes are needed to support LoRA.
Saver. src/llama-model-saver.cpp can merge LoRA into the base weights and write a new GGUF, which llama-export-lora exposes.

Activated LoRA (aLoRA)

Some adapters only fire after specific "invocation tokens" appear in the prompt. llama_adapter_lora_get_alora_invocation_tokens returns the trigger tokens stored in the GGUF metadata; tools/server and llama-cli can use this to switch the LoRA on/off mid-generation.

Entry points for modification

New adapter type. Add the GGUF metadata keys, extend llama_adapter_* types, and inject the application logic into build_lora_mm (or a sibling helper).
Per-token scaling. The current API uses a single scalar; adding token-level scaling would require new graph plumbing.
Conversion. convert_lora_to_gguf.py is the place to support new HuggingFace PEFT formats.

tools/cvector-generator/ — produces control-vector GGUFs.
tools/export-lora/ — merges a LoRA into a base GGUF and writes the result.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.