ggml-org/llama.cpp
Adapters
Active contributors: Sigbjørn Skjæret, Georgi Gerganov
llama.cpp supports two kinds of runtime weight modifications: LoRA adapters and control vectors. Both are loaded as separate GGUF files and applied on top of a base model without rewriting the base weights.
Purpose
- Load LoRA adapter weights and apply them to the base model's matmuls.
- Load control vectors (per-layer biases) used for "steering" experiments.
- Allow multiple adapters to be hot-swapped or scaled per request.
Key abstractions
| Type | Role | File |
|---|---|---|
llama_adapter_lora |
A loaded LoRA: per-layer A/B factor pairs + scale | src/llama-adapter.h |
llama_adapter_cvec |
Per-layer control vector | src/llama-adapter.h |
llama_adapter_lora_init, _free, _get_alora_invocation_tokens |
Public C API | include/llama.h |
llama_set_adapter_lora(ctx, lora, scale) |
Activate (or deactivate with scale=0) a LoRA on a context |
include/llama.h |
build_lora_mm |
Graph helper that injects active LoRA matmuls | src/llama-graph.h, src/llama-graph.cpp |
src/llama-adapter.cpp (~18 KB) handles the loading and the per-tensor matching against the base model.
How LoRA application works
graph TD
Base[Base weight W] --> Mat[ggml_mul_mat with input X]
Lora[LoRA A and B] --> Apply{any LoRA active?}
Apply -->|no| Mat
Apply -->|yes| Add[X@W + scale * X@A@B]
Mat --> Out[layer output]
Add --> Outbuild_lora_mm in src/llama-graph.cpp is called by every per-architecture builder when it does a weight matmul. The helper looks up whether a LoRA is loaded for that weight, and if so emits the additional A @ B matmuls and adds them with the configured scale.
File format
LoRA adapter files are GGUFs with a small set of metadata keys (general.type = "adapter", adapter.type = "lora", adapter.alpha, adapter.lora.tokens for activated LoRA) and per-layer tensors named to match the base model's tensor manifest. convert_lora_to_gguf.py is the conversion path from HuggingFace PEFT exports.
Control vectors use a similar GGUF layout and are produced by tools/cvector-generator/.
Integration points
- CLI. Tools accept
--lora,--lora-scaled,--lora-init-without-apply, and--control-vector. Seecommon/arg.cpp. - Server.
tools/serverexposes per-request LoRA scaling through its API surface. - Graph builder. Every weight matmul goes through
build_lora_mm, so no per-architecture code changes are needed to support LoRA. - Saver.
src/llama-model-saver.cppcan merge LoRA into the base weights and write a new GGUF, whichllama-export-loraexposes.
Activated LoRA (aLoRA)
Some adapters only fire after specific "invocation tokens" appear in the prompt. llama_adapter_lora_get_alora_invocation_tokens returns the trigger tokens stored in the GGUF metadata; tools/server and llama-cli can use this to switch the LoRA on/off mid-generation.
Entry points for modification
- New adapter type. Add the GGUF metadata keys, extend
llama_adapter_*types, and inject the application logic intobuild_lora_mm(or a sibling helper). - Per-token scaling. The current API uses a single scalar; adding token-level scaling would require new graph plumbing.
- Conversion.
convert_lora_to_gguf.pyis the place to support new HuggingFace PEFT formats.
Related tools
tools/cvector-generator/— produces control-vector GGUFs.tools/export-lora/— merges a LoRA into a base GGUF and writes the result.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Quantization
Next
Tools