ggml-org/llama.cpp

KV cache and memory

Active contributors: Georgi Gerganov

Transformer inference is dominated by the cost of the attention layer's key/value cache. llama.cpp factors that cost behind a llama_memory interface that the graph builder reads/writes from. There are four implementations covering standard attention, sliding-window attention, recurrent state-space models, and hybrid layouts.

Purpose

Keep per-sequence key/value tensors across llama_decode calls.
Manage slot allocation, eviction, and defragmentation.
Hide the difference between standard transformers, SWA models, SSMs, and hybrids from the graph code.

The four implementations

Implementation	Used by	File
`llama_kv_cache`	Standard transformers (LLaMA, Qwen, Mistral, Phi, ...)	`src/llama-kv-cache.cpp`
`llama_kv_cache_iswa`	Models with interleaved sliding-window attention (Gemma 2/3, Phi-3, some Qwen variants)	`src/llama-kv-cache-iswa.cpp`
`llama_memory_recurrent`	Pure SSMs (Mamba, Mamba2, RWKV-6, RWKV-7, FalconMamba)	`src/llama-memory-recurrent.cpp`
`llama_memory_hybrid` (+ `iswa` variant)	Hybrid layouts (Jamba, Granite Hybrid, BailingMoeV2)	`src/llama-memory-hybrid.cpp`, `src/llama-memory-hybrid-iswa.cpp`

All four implement the same llama_memory interface declared in src/llama-memory.h.

Standard KV cache

src/llama-kv-cache.cpp is the reference implementation. It maintains per-token cells in llama-kv-cells.h, each holding the sequence id, position, and pointer into the K/V tensor buffers. The cache is contiguous in memory but logically partitioned by sequence.

graph LR
    Decode[llama_decode batch] --> Apply[memory.apply]
    Apply --> Find[find or allocate slots per token]
    Find --> Build[graph builder reads/writes K/V at slot]
    Build --> Compute[ggml backend computes]
    Compute --> Commit[memory.commit]
    Commit --> Done[slots are durable]

Operations exposed to users (declared in include/llama.h):

llama_kv_self_seq_* family — copy, divide, shift, remove sequences.
llama_kv_self_clear — wipe everything.
llama_kv_self_defrag — compact fragmented slots.
llama_kv_self_seq_keep — drop everything except a given sequence (used by the server when a slot is reused).

cparams.cache_type_k / cparams.cache_type_v (-ctk, -ctv in tools) select the precision of the cache itself: f16 (default), bf16, q8_0, q4_0, etc. Quantized KV cuts memory roughly in half at a measurable but small quality cost.

Defragmentation

After many llama_kv_self_seq_* operations, the cache becomes fragmented. llama_kv_self_defrag and the auto-defrag policy in cparams.defrag_thold rearrange cells to keep large contiguous slots available.

Sliding-window (iSWA)

src/llama-kv-cache-iswa.cpp wraps two underlying caches: one for "normal" (full-context) layers and one for "sliding-window" layers. Models like Gemma 2/3 alternate between the two and only need a window-sized cache for SWA layers, which dramatically reduces memory for long contexts.

The iswa variant tags layers as full vs windowed using hparams.is_swa(layer). The graph builder asks the memory state which kind of slot to use per layer.

Recurrent (SSM)

src/llama-memory-recurrent.cpp is for true SSMs with no token-by-token KV. Instead it stores the recurrent state tensor(s) (Mamba's ssm_d_state, RWKV's hidden state) per sequence. Slot allocation is sequence-level rather than token-level.

Hybrid

Some recent architectures alternate transformer attention layers with SSM layers (Jamba, Granite Hybrid). src/llama-memory-hybrid.cpp composes a llama_kv_cache for the attention layers with a llama_memory_recurrent for the SSM layers. The iswa variant adds sliding-window support on top.

Key abstractions

Type	Role	File
`llama_memory`	Polymorphic memory backend interface	`src/llama-memory.h`
`llama_memory_state`	Pre-decode "what slots will I touch" object	`src/llama-memory.h`
`llama_kv_cells`	Per-cell metadata for the standard KV cache	`src/llama-kv-cells.h`
`llama_kv_cache::slot_info`	Result of slot search	`src/llama-kv-cache.h`
`cparams.cache_type_k`, `cache_type_v`	Quantization type for KV	`src/llama-cparams.h`
`cparams.defrag_thold`	Auto-defrag threshold	`src/llama-cparams.h`

How a decode interacts with memory

sequenceDiagram
    participant Caller
    participant Ctx as llama_context
    participant Mem as llama_memory
    participant Graph as llama-graph
    participant GGML

    Caller->>Ctx: llama_decode(batch)
    Ctx->>Mem: apply(batch) -> memory_state
    Mem-->>Ctx: per-token slot mapping
    Ctx->>Graph: build_graph(arch, batch, memory_state)
    Graph->>GGML: schedule + compute
    GGML-->>Ctx: logits
    Ctx->>Mem: commit() (durable on success)

apply is reversible (it can be discarded if the graph build fails). commit makes the new slot assignments durable.

Integration points

llama-context picks the memory implementation at init time based on model->arch and hparams.
Server slots. Each tools/server slot owns a seq_id and uses llama_kv_self_seq_keep / llama_kv_self_seq_rm aggressively to manage prompt caching.
Speculative decoding. Verification swaps draft and target sequences in and out of the cache via llama_kv_self_seq_*.
State save/load. llama_state_* serializes a single sequence's KV state so it can be reloaded later (used by examples/save-load-state/).

Entry points for modification

New cache layout (e.g. block-sparse). Implement llama_memory and have llama-context select it for the relevant architectures.
New KV quantization type. Extend ggml_type and ensure backends support it; the cache itself is already type-agnostic.
Eviction policy. Look at the slot-search functions in src/llama-kv-cache.cpp — they're the natural place to plug in alternative policies.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.