ggml-org/llama.cpp
KV cache and memory
Active contributors: Georgi Gerganov
Transformer inference is dominated by the cost of the attention layer's key/value cache. llama.cpp factors that cost behind a llama_memory interface that the graph builder reads/writes from. There are four implementations covering standard attention, sliding-window attention, recurrent state-space models, and hybrid layouts.
Purpose
- Keep per-sequence key/value tensors across
llama_decodecalls. - Manage slot allocation, eviction, and defragmentation.
- Hide the difference between standard transformers, SWA models, SSMs, and hybrids from the graph code.
The four implementations
| Implementation | Used by | File |
|---|---|---|
llama_kv_cache |
Standard transformers (LLaMA, Qwen, Mistral, Phi, ...) | src/llama-kv-cache.cpp |
llama_kv_cache_iswa |
Models with interleaved sliding-window attention (Gemma 2/3, Phi-3, some Qwen variants) | src/llama-kv-cache-iswa.cpp |
llama_memory_recurrent |
Pure SSMs (Mamba, Mamba2, RWKV-6, RWKV-7, FalconMamba) | src/llama-memory-recurrent.cpp |
llama_memory_hybrid (+ iswa variant) |
Hybrid layouts (Jamba, Granite Hybrid, BailingMoeV2) | src/llama-memory-hybrid.cpp, src/llama-memory-hybrid-iswa.cpp |
All four implement the same llama_memory interface declared in src/llama-memory.h.
Standard KV cache
src/llama-kv-cache.cpp is the reference implementation. It maintains per-token cells in llama-kv-cells.h, each holding the sequence id, position, and pointer into the K/V tensor buffers. The cache is contiguous in memory but logically partitioned by sequence.
graph LR
Decode[llama_decode batch] --> Apply[memory.apply]
Apply --> Find[find or allocate slots per token]
Find --> Build[graph builder reads/writes K/V at slot]
Build --> Compute[ggml backend computes]
Compute --> Commit[memory.commit]
Commit --> Done[slots are durable]Operations exposed to users (declared in include/llama.h):
llama_kv_self_seq_*family — copy, divide, shift, remove sequences.llama_kv_self_clear— wipe everything.llama_kv_self_defrag— compact fragmented slots.llama_kv_self_seq_keep— drop everything except a given sequence (used by the server when a slot is reused).
cparams.cache_type_k / cparams.cache_type_v (-ctk, -ctv in tools) select the precision of the cache itself: f16 (default), bf16, q8_0, q4_0, etc. Quantized KV cuts memory roughly in half at a measurable but small quality cost.
Defragmentation
After many llama_kv_self_seq_* operations, the cache becomes fragmented. llama_kv_self_defrag and the auto-defrag policy in cparams.defrag_thold rearrange cells to keep large contiguous slots available.
Sliding-window (iSWA)
src/llama-kv-cache-iswa.cpp wraps two underlying caches: one for "normal" (full-context) layers and one for "sliding-window" layers. Models like Gemma 2/3 alternate between the two and only need a window-sized cache for SWA layers, which dramatically reduces memory for long contexts.
The iswa variant tags layers as full vs windowed using hparams.is_swa(layer). The graph builder asks the memory state which kind of slot to use per layer.
Recurrent (SSM)
src/llama-memory-recurrent.cpp is for true SSMs with no token-by-token KV. Instead it stores the recurrent state tensor(s) (Mamba's ssm_d_state, RWKV's hidden state) per sequence. Slot allocation is sequence-level rather than token-level.
Hybrid
Some recent architectures alternate transformer attention layers with SSM layers (Jamba, Granite Hybrid). src/llama-memory-hybrid.cpp composes a llama_kv_cache for the attention layers with a llama_memory_recurrent for the SSM layers. The iswa variant adds sliding-window support on top.
Key abstractions
| Type | Role | File |
|---|---|---|
llama_memory |
Polymorphic memory backend interface | src/llama-memory.h |
llama_memory_state |
Pre-decode "what slots will I touch" object | src/llama-memory.h |
llama_kv_cells |
Per-cell metadata for the standard KV cache | src/llama-kv-cells.h |
llama_kv_cache::slot_info |
Result of slot search | src/llama-kv-cache.h |
cparams.cache_type_k, cache_type_v |
Quantization type for KV | src/llama-cparams.h |
cparams.defrag_thold |
Auto-defrag threshold | src/llama-cparams.h |
How a decode interacts with memory
sequenceDiagram
participant Caller
participant Ctx as llama_context
participant Mem as llama_memory
participant Graph as llama-graph
participant GGML
Caller->>Ctx: llama_decode(batch)
Ctx->>Mem: apply(batch) -> memory_state
Mem-->>Ctx: per-token slot mapping
Ctx->>Graph: build_graph(arch, batch, memory_state)
Graph->>GGML: schedule + compute
GGML-->>Ctx: logits
Ctx->>Mem: commit() (durable on success)apply is reversible (it can be discarded if the graph build fails). commit makes the new slot assignments durable.
Integration points
llama-contextpicks the memory implementation at init time based onmodel->archand hparams.- Server slots. Each
tools/serverslot owns aseq_idand usesllama_kv_self_seq_keep/llama_kv_self_seq_rmaggressively to manage prompt caching. - Speculative decoding. Verification swaps draft and target sequences in and out of the cache via
llama_kv_self_seq_*. - State save/load.
llama_state_*serializes a single sequence's KV state so it can be reloaded later (used byexamples/save-load-state/).
Entry points for modification
- New cache layout (e.g. block-sparse). Implement
llama_memoryand havellama-contextselect it for the relevant architectures. - New KV quantization type. Extend
ggml_typeand ensure backends support it; the cache itself is already type-agnostic. - Eviction policy. Look at the slot-search functions in
src/llama-kv-cache.cpp— they're the natural place to plug in alternative policies.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Computation graph
Next
Sampler