Factory.ai

Open-Source Wikis

/

llama.cpp

/

Tools

/

Multimodal (mtmd)

ggml-org/llama.cpp

Multimodal (mtmd)

Active contributors: Xuan-Son Nguyen

The tools/mtmd/ directory holds llama.cpp's multimodal stack. It includes a CLIP-derived vision encoder, an audio encoder, the llama-mtmd-cli binary, and the helpers used by tools/server to serve image/audio inputs over HTTP.

User-facing docs: tools/mtmd/README.md, docs/multimodal.md, docs/multimodal/ (per-model guides).

What "mtmd" means

"mtmd" stands for "multi-modal data". It is the successor to the older llava standalone tool, which lives on under tools/mtmd/legacy-models/ for compatibility. The current code supports vision-language models (LLaVA 1.5/1.6, MobileVLM, Qwen2-VL, MiniCPM-V, Gemma-3, GLM-EDGE, Moondream, LFM2-VL, ...) and audio-language flows.

Directory layout

tools/mtmd/
├── mtmd.h, mtmd.cpp            # Public C API + main implementation (~60 KB)
├── mtmd-cli.cpp                # llama-mtmd-cli binary
├── mtmd-helper.h, mtmd-helper.cpp  # Higher-level helpers used by the server
├── mtmd-image.h, mtmd-image.cpp    # Image preprocessing (resize, tile, normalize)
├── mtmd-audio.h, mtmd-audio.cpp    # Audio preprocessing (mel spectrogram, STFT)
├── clip.h, clip.cpp            # The vision encoder itself (~188 KB)
├── clip-graph.h                # Vision graph builder (analogous to src/llama-graph.h)
├── clip-impl.h                 # Internal types
├── clip-model.h                # Vision model variants enumeration
├── deprecation-warning.cpp     # Old `llava` binary forwarder
├── debug/                      # Diagnostic helpers
├── legacy-models/              # Older standalone llava / minicpmv binaries
├── models/                     # Per-vision-model conversion + quirks
├── tests/                      # Test fixtures
└── tests.sh                    # Shell-driven end-to-end test

Components

CLIP-style vision encoder

clip.cpp is a self-contained encoder that loads a vision-only GGUF (often called an "mmproj" — "multimodal projector"), preprocesses an image, runs the encoder graph, and produces a sequence of vision tokens that get spliced into the text model's prompt embeddings. It supports many CLIP-derived variants (LLaVA, SigLIP, ViT, GLM-EDGE-Vision, Qwen2-VL, ...). The graph builder in clip-graph.h mirrors the structure of src/llama-graph.h but for vision blocks.

Audio encoder

mtmd-audio.cpp adds a Whisper-style encoder path. It runs STFT, mel-spectrogram, and a transformer encoder to produce audio tokens. The same general "encoder produces tokens, text model consumes them" flow as vision.

High-level API

mtmd.h exposes a small public API:

Function Purpose
mtmd_init_from_file Load an mmproj GGUF
mtmd_tokenize Run an image/audio through the encoder, get tokens
mtmd_input_chunks_* Walk per-piece chunks (interleave text with media)
mtmd_helper_eval_chunks Convenience: tokenize and decode a full mixed input
mtmd_get_output_embd Retrieve the vision token embeddings as raw floats

mtmd-helper.cpp wraps these for the common case of "decode a prompt that contains text and images."

llama-mtmd-cli

A single binary that mirrors llama-cli but accepts --image path / --audio path flags. Source: mtmd-cli.cpp.

Server integration

tools/server/server-context.cpp consumes mtmd-helper.h to handle image/audio in OpenAI-compatible chat requests (messages[].content arrays with image_url / input_audio). The server holds an mtmd_context alongside its llama_context and feeds vision/audio tokens into the same slot scheduler.

How a vision request flows

sequenceDiagram
    participant Client
    participant Srv as llama-server
    participant Help as mtmd-helper
    participant Clip as clip.cpp
    participant Llama as libllama

    Client->>Srv: POST /v1/chat/completions {image_url, text}
    Srv->>Help: prepare chunks (text + image)
    Help->>Clip: tokenize image
    Clip-->>Help: vision tokens
    Help->>Llama: llama_decode (text + vision tokens)
    Llama-->>Srv: text tokens
    Srv-->>Client: SSE stream

Integration points

  • libllama — vision/audio tokens are spliced into llama_decode as ordinary embeddings.
  • tools/server/ — heaviest consumer; wraps mtmd-helper per request.
  • Conversion. Vision encoders are converted to mmproj GGUFs by per-model scripts (under tools/mtmd/models/ and the wider model-conversion pipeline).
  • Tests. tools/mtmd/tests.sh runs llama-mtmd-cli against checked-in test-1.jpeg and test-2.mp3.

Entry points for modification

  • New vision model. Add a converter script under tools/mtmd/models/, update clip-model.h, add quirks to clip.cpp's graph builder. The legacy tools/mtmd/legacy-models/ is a reference for the simpler LLaVA-style flow.
  • New audio path. mtmd-audio.cpp is the place; it is intentionally Whisper-style today.
  • New mtmd API. Edit mtmd.h plus mtmd.cpp; update mtmd-helper.cpp and the server consumer.

Tests

  • tools/mtmd/tests/ — C++ tests for mtmd.cpp and helpers.
  • tools/mtmd/tests.sh — end-to-end driver.
  • tests/test-mtmd-c-api.cpp — public C API smoke test.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Multimodal (mtmd) – llama.cpp wiki | Factory