ggml-org/llama.cpp
Multimodal (mtmd)
Active contributors: Xuan-Son Nguyen
The tools/mtmd/ directory holds llama.cpp's multimodal stack. It includes a CLIP-derived vision encoder, an audio encoder, the llama-mtmd-cli binary, and the helpers used by tools/server to serve image/audio inputs over HTTP.
User-facing docs: tools/mtmd/README.md, docs/multimodal.md, docs/multimodal/ (per-model guides).
What "mtmd" means
"mtmd" stands for "multi-modal data". It is the successor to the older llava standalone tool, which lives on under tools/mtmd/legacy-models/ for compatibility. The current code supports vision-language models (LLaVA 1.5/1.6, MobileVLM, Qwen2-VL, MiniCPM-V, Gemma-3, GLM-EDGE, Moondream, LFM2-VL, ...) and audio-language flows.
Directory layout
tools/mtmd/
├── mtmd.h, mtmd.cpp # Public C API + main implementation (~60 KB)
├── mtmd-cli.cpp # llama-mtmd-cli binary
├── mtmd-helper.h, mtmd-helper.cpp # Higher-level helpers used by the server
├── mtmd-image.h, mtmd-image.cpp # Image preprocessing (resize, tile, normalize)
├── mtmd-audio.h, mtmd-audio.cpp # Audio preprocessing (mel spectrogram, STFT)
├── clip.h, clip.cpp # The vision encoder itself (~188 KB)
├── clip-graph.h # Vision graph builder (analogous to src/llama-graph.h)
├── clip-impl.h # Internal types
├── clip-model.h # Vision model variants enumeration
├── deprecation-warning.cpp # Old `llava` binary forwarder
├── debug/ # Diagnostic helpers
├── legacy-models/ # Older standalone llava / minicpmv binaries
├── models/ # Per-vision-model conversion + quirks
├── tests/ # Test fixtures
└── tests.sh # Shell-driven end-to-end testComponents
CLIP-style vision encoder
clip.cpp is a self-contained encoder that loads a vision-only GGUF (often called an "mmproj" — "multimodal projector"), preprocesses an image, runs the encoder graph, and produces a sequence of vision tokens that get spliced into the text model's prompt embeddings. It supports many CLIP-derived variants (LLaVA, SigLIP, ViT, GLM-EDGE-Vision, Qwen2-VL, ...). The graph builder in clip-graph.h mirrors the structure of src/llama-graph.h but for vision blocks.
Audio encoder
mtmd-audio.cpp adds a Whisper-style encoder path. It runs STFT, mel-spectrogram, and a transformer encoder to produce audio tokens. The same general "encoder produces tokens, text model consumes them" flow as vision.
High-level API
mtmd.h exposes a small public API:
| Function | Purpose |
|---|---|
mtmd_init_from_file |
Load an mmproj GGUF |
mtmd_tokenize |
Run an image/audio through the encoder, get tokens |
mtmd_input_chunks_* |
Walk per-piece chunks (interleave text with media) |
mtmd_helper_eval_chunks |
Convenience: tokenize and decode a full mixed input |
mtmd_get_output_embd |
Retrieve the vision token embeddings as raw floats |
mtmd-helper.cpp wraps these for the common case of "decode a prompt that contains text and images."
llama-mtmd-cli
A single binary that mirrors llama-cli but accepts --image path / --audio path flags. Source: mtmd-cli.cpp.
Server integration
tools/server/server-context.cpp consumes mtmd-helper.h to handle image/audio in OpenAI-compatible chat requests (messages[].content arrays with image_url / input_audio). The server holds an mtmd_context alongside its llama_context and feeds vision/audio tokens into the same slot scheduler.
How a vision request flows
sequenceDiagram
participant Client
participant Srv as llama-server
participant Help as mtmd-helper
participant Clip as clip.cpp
participant Llama as libllama
Client->>Srv: POST /v1/chat/completions {image_url, text}
Srv->>Help: prepare chunks (text + image)
Help->>Clip: tokenize image
Clip-->>Help: vision tokens
Help->>Llama: llama_decode (text + vision tokens)
Llama-->>Srv: text tokens
Srv-->>Client: SSE streamIntegration points
libllama— vision/audio tokens are spliced intollama_decodeas ordinary embeddings.tools/server/— heaviest consumer; wrapsmtmd-helperper request.- Conversion. Vision encoders are converted to mmproj GGUFs by per-model scripts (under
tools/mtmd/models/and the wider model-conversion pipeline). - Tests.
tools/mtmd/tests.shrunsllama-mtmd-cliagainst checked-intest-1.jpegandtest-2.mp3.
Entry points for modification
- New vision model. Add a converter script under
tools/mtmd/models/, updateclip-model.h, add quirks toclip.cpp's graph builder. The legacytools/mtmd/legacy-models/is a reference for the simpler LLaVA-style flow. - New audio path.
mtmd-audio.cppis the place; it is intentionally Whisper-style today. - New mtmd API. Edit
mtmd.hplusmtmd.cpp; updatemtmd-helper.cppand the server consumer.
Tests
tools/mtmd/tests/— C++ tests formtmd.cppand helpers.tools/mtmd/tests.sh— end-to-end driver.tests/test-mtmd-c-api.cpp— public C API smoke test.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
llama-bench
Next
Other tools