Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

KV cache and memory

ggml-org/llama.cpp

KV cache and memory

Active contributors: Georgi Gerganov

Transformer inference is dominated by the cost of the attention layer's key/value cache. llama.cpp factors that cost behind a llama_memory interface that the graph builder reads/writes from. There are four implementations covering standard attention, sliding-window attention, recurrent state-space models, and hybrid layouts.

Purpose

  • Keep per-sequence key/value tensors across llama_decode calls.
  • Manage slot allocation, eviction, and defragmentation.
  • Hide the difference between standard transformers, SWA models, SSMs, and hybrids from the graph code.

The four implementations

Implementation Used by File
llama_kv_cache Standard transformers (LLaMA, Qwen, Mistral, Phi, ...) src/llama-kv-cache.cpp
llama_kv_cache_iswa Models with interleaved sliding-window attention (Gemma 2/3, Phi-3, some Qwen variants) src/llama-kv-cache-iswa.cpp
llama_memory_recurrent Pure SSMs (Mamba, Mamba2, RWKV-6, RWKV-7, FalconMamba) src/llama-memory-recurrent.cpp
llama_memory_hybrid (+ iswa variant) Hybrid layouts (Jamba, Granite Hybrid, BailingMoeV2) src/llama-memory-hybrid.cpp, src/llama-memory-hybrid-iswa.cpp

All four implement the same llama_memory interface declared in src/llama-memory.h.

Standard KV cache

src/llama-kv-cache.cpp is the reference implementation. It maintains per-token cells in llama-kv-cells.h, each holding the sequence id, position, and pointer into the K/V tensor buffers. The cache is contiguous in memory but logically partitioned by sequence.

graph LR
    Decode[llama_decode batch] --> Apply[memory.apply]
    Apply --> Find[find or allocate slots per token]
    Find --> Build[graph builder reads/writes K/V at slot]
    Build --> Compute[ggml backend computes]
    Compute --> Commit[memory.commit]
    Commit --> Done[slots are durable]

Operations exposed to users (declared in include/llama.h):

  • llama_kv_self_seq_* family — copy, divide, shift, remove sequences.
  • llama_kv_self_clear — wipe everything.
  • llama_kv_self_defrag — compact fragmented slots.
  • llama_kv_self_seq_keep — drop everything except a given sequence (used by the server when a slot is reused).

cparams.cache_type_k / cparams.cache_type_v (-ctk, -ctv in tools) select the precision of the cache itself: f16 (default), bf16, q8_0, q4_0, etc. Quantized KV cuts memory roughly in half at a measurable but small quality cost.

Defragmentation

After many llama_kv_self_seq_* operations, the cache becomes fragmented. llama_kv_self_defrag and the auto-defrag policy in cparams.defrag_thold rearrange cells to keep large contiguous slots available.

Sliding-window (iSWA)

src/llama-kv-cache-iswa.cpp wraps two underlying caches: one for "normal" (full-context) layers and one for "sliding-window" layers. Models like Gemma 2/3 alternate between the two and only need a window-sized cache for SWA layers, which dramatically reduces memory for long contexts.

The iswa variant tags layers as full vs windowed using hparams.is_swa(layer). The graph builder asks the memory state which kind of slot to use per layer.

Recurrent (SSM)

src/llama-memory-recurrent.cpp is for true SSMs with no token-by-token KV. Instead it stores the recurrent state tensor(s) (Mamba's ssm_d_state, RWKV's hidden state) per sequence. Slot allocation is sequence-level rather than token-level.

Hybrid

Some recent architectures alternate transformer attention layers with SSM layers (Jamba, Granite Hybrid). src/llama-memory-hybrid.cpp composes a llama_kv_cache for the attention layers with a llama_memory_recurrent for the SSM layers. The iswa variant adds sliding-window support on top.

Key abstractions

Type Role File
llama_memory Polymorphic memory backend interface src/llama-memory.h
llama_memory_state Pre-decode "what slots will I touch" object src/llama-memory.h
llama_kv_cells Per-cell metadata for the standard KV cache src/llama-kv-cells.h
llama_kv_cache::slot_info Result of slot search src/llama-kv-cache.h
cparams.cache_type_k, cache_type_v Quantization type for KV src/llama-cparams.h
cparams.defrag_thold Auto-defrag threshold src/llama-cparams.h

How a decode interacts with memory

sequenceDiagram
    participant Caller
    participant Ctx as llama_context
    participant Mem as llama_memory
    participant Graph as llama-graph
    participant GGML

    Caller->>Ctx: llama_decode(batch)
    Ctx->>Mem: apply(batch) -> memory_state
    Mem-->>Ctx: per-token slot mapping
    Ctx->>Graph: build_graph(arch, batch, memory_state)
    Graph->>GGML: schedule + compute
    GGML-->>Ctx: logits
    Ctx->>Mem: commit() (durable on success)

apply is reversible (it can be discarded if the graph build fails). commit makes the new slot assignments durable.

Integration points

  • llama-context picks the memory implementation at init time based on model->arch and hparams.
  • Server slots. Each tools/server slot owns a seq_id and uses llama_kv_self_seq_keep / llama_kv_self_seq_rm aggressively to manage prompt caching.
  • Speculative decoding. Verification swaps draft and target sequences in and out of the cache via llama_kv_self_seq_*.
  • State save/load. llama_state_* serializes a single sequence's KV state so it can be reloaded later (used by examples/save-load-state/).

Entry points for modification

  • New cache layout (e.g. block-sparse). Implement llama_memory and have llama-context select it for the relevant architectures.
  • New KV quantization type. Extend ggml_type and ensure backends support it; the cache itself is already type-agnostic.
  • Eviction policy. Look at the slot-search functions in src/llama-kv-cache.cpp — they're the natural place to plug in alternative policies.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

KV cache and memory – llama.cpp wiki | Factory