ggml-org/llama.cpp

Vocab and tokenizer

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

src/llama-vocab.cpp is the in-tree tokenizer. It implements every tokenizer family used by the supported models: Byte-Pair Encoding (BPE), SentencePiece (SPM), WordPiece (WPM), Unigram (UGM), and the byte-level RWKV tokenizer. Vocab data is read from GGUF metadata at load time.

Purpose

Decode the vocab section of a GGUF file into an in-memory llama_vocab.
Tokenize text → token ids and back, matching the reference HuggingFace tokenizer for that model.
Provide special-token bookkeeping (bos, eos, pad, cls, sep, plus per-model tokens like <|im_start|> or <|tool_call|>).

Directory layout

src/
├── llama-vocab.h          # llama_vocab + types
├── llama-vocab.cpp        # ~166 KB; all five tokenizer implementations
├── unicode.cpp / .h       # codepoint helpers, NFC normalization, byte-fallback
└── unicode-data.cpp / .h  # generated Unicode tables (categories, lowercasing, etc.)

Key abstractions

Type / value	Role	File
`enum llama_vocab_type`	`LLAMA_VOCAB_TYPE_NONE/SPM/BPE/WPM/UGM/RWKV`	`include/llama.h`
`enum llama_vocab_pre_type`	Per-model pre-tokenizer regex variant (`LLAMA_VOCAB_PRE_TYPE_LLAMA3`, `..._GPT2`, `..._QWEN2`, ...)	`include/llama.h`
`llama_vocab` (private)	Holds vocab data, special tokens, BPE merges, regex pre-tokenizer state	`src/llama-vocab.h`
`llama_token`	`int32_t` typedef for token ids	`include/llama.h`
`llm_tokenizer_*` classes	Per-family tokenizer implementations	`src/llama-vocab.cpp`

How it works

The loader reads three GGUF metadata sections to populate the vocab:

tokenizer.ggml.tokens and friends — the actual vocabulary list, scores, and token types.
tokenizer.ggml.bos_token_id, ..._eos_token_id, etc. — special token ids.
tokenizer.ggml.pre — a string identifying the pre-tokenizer (e.g. "llama3", "gpt-2", "qwen2"). This is hashed against a known set of regexes embedded in convert_hf_to_gguf_update.py to determine the pre-tokenizer variant.

graph LR
    Text -->|llama_tokenize| Pre[Pre-tokenizer regex split]
    Pre --> Tokenizer{vocab_type}
    Tokenizer -->|BPE| BPE[BPE merges]
    Tokenizer -->|SPM| SPM[Sentencepiece + byte fallback]
    Tokenizer -->|WPM| WPM[WordPiece greedy match]
    Tokenizer -->|UGM| UGM[Unigram lattice viterbi]
    Tokenizer -->|RWKV| RWKV[Trie-based byte tokenizer]
    BPE --> Tokens[token ids]
    SPM --> Tokens
    WPM --> Tokens
    UGM --> Tokens
    RWKV --> Tokens

Detokenization is the inverse — llama_token_to_piece returns the printable bytes for a single token, with byte-fallback handling for invalid UTF-8 sequences.

Pre-tokenizer registry

Different LLaMA-3-style models use slightly different regex pre-tokenizers. To avoid embedding every regex in C++, llama.cpp computes a stable hash of the regex from the upstream tokenizer JSON and stores the name of the matching family in tokenizer.ggml.pre. The Python script convert_hf_to_gguf_update.py keeps the hash table in sync. This is why adding a new tokenizer family typically requires touching that script as well.

Integration points

Loader. src/llama-model-loader.cpp reads vocab metadata and constructs the llama_vocab.
Chat templates. src/llama-chat.cpp and common/chat.cpp consult llama_vocab for the model's special tokens.
Sampler. Some samplers (e.g. infill) need vocab-specific token ids; they fetch them from llama_vocab.
Tools. tools/tokenize/ (now llama-tokenize) is a thin wrapper around llama_tokenize for shell debugging.

Entry points for modification

New tokenizer pre-type. Edit LLAMA_VOCAB_PRE_TYPE_* in include/llama.h, add the regex/handler in src/llama-vocab.cpp, and update convert_hf_to_gguf_update.py to emit the right tokenizer.ggml.pre value.
New special token. Most special-token plumbing is data-driven; add the GGUF metadata key, then read it in the loader.
Tokenizer correctness. Always add a case to tests/test-tokenizer-* with the reference tokenizer's output.

Tests

tests/test-tokenizer-0, test-tokenizer-1-bpe, test-tokenizer-1-spm, test-tokenizer-pre, etc. exercise the implementation against checked-in golden outputs. The models/templates/ directory contains the reference vocab files used for round-trip tests.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.