ggml-org/llama.cpp
Vocab and tokenizer
Active contributors: Sigbjørn Skjæret, Georgi Gerganov
src/llama-vocab.cpp is the in-tree tokenizer. It implements every tokenizer family used by the supported models: Byte-Pair Encoding (BPE), SentencePiece (SPM), WordPiece (WPM), Unigram (UGM), and the byte-level RWKV tokenizer. Vocab data is read from GGUF metadata at load time.
Purpose
- Decode the vocab section of a GGUF file into an in-memory
llama_vocab. - Tokenize text → token ids and back, matching the reference HuggingFace tokenizer for that model.
- Provide special-token bookkeeping (
bos,eos,pad,cls,sep, plus per-model tokens like<|im_start|>or<|tool_call|>).
Directory layout
src/
├── llama-vocab.h # llama_vocab + types
├── llama-vocab.cpp # ~166 KB; all five tokenizer implementations
├── unicode.cpp / .h # codepoint helpers, NFC normalization, byte-fallback
└── unicode-data.cpp / .h # generated Unicode tables (categories, lowercasing, etc.)Key abstractions
| Type / value | Role | File |
|---|---|---|
enum llama_vocab_type |
LLAMA_VOCAB_TYPE_NONE/SPM/BPE/WPM/UGM/RWKV |
include/llama.h |
enum llama_vocab_pre_type |
Per-model pre-tokenizer regex variant (LLAMA_VOCAB_PRE_TYPE_LLAMA3, ..._GPT2, ..._QWEN2, ...) |
include/llama.h |
llama_vocab (private) |
Holds vocab data, special tokens, BPE merges, regex pre-tokenizer state | src/llama-vocab.h |
llama_token |
int32_t typedef for token ids |
include/llama.h |
llm_tokenizer_* classes |
Per-family tokenizer implementations | src/llama-vocab.cpp |
How it works
The loader reads three GGUF metadata sections to populate the vocab:
tokenizer.ggml.tokensand friends — the actual vocabulary list, scores, and token types.tokenizer.ggml.bos_token_id,..._eos_token_id, etc. — special token ids.tokenizer.ggml.pre— a string identifying the pre-tokenizer (e.g."llama3","gpt-2","qwen2"). This is hashed against a known set of regexes embedded inconvert_hf_to_gguf_update.pyto determine the pre-tokenizer variant.
graph LR
Text -->|llama_tokenize| Pre[Pre-tokenizer regex split]
Pre --> Tokenizer{vocab_type}
Tokenizer -->|BPE| BPE[BPE merges]
Tokenizer -->|SPM| SPM[Sentencepiece + byte fallback]
Tokenizer -->|WPM| WPM[WordPiece greedy match]
Tokenizer -->|UGM| UGM[Unigram lattice viterbi]
Tokenizer -->|RWKV| RWKV[Trie-based byte tokenizer]
BPE --> Tokens[token ids]
SPM --> Tokens
WPM --> Tokens
UGM --> Tokens
RWKV --> TokensDetokenization is the inverse — llama_token_to_piece returns the printable bytes for a single token, with byte-fallback handling for invalid UTF-8 sequences.
Pre-tokenizer registry
Different LLaMA-3-style models use slightly different regex pre-tokenizers. To avoid embedding every regex in C++, llama.cpp computes a stable hash of the regex from the upstream tokenizer JSON and stores the name of the matching family in tokenizer.ggml.pre. The Python script convert_hf_to_gguf_update.py keeps the hash table in sync. This is why adding a new tokenizer family typically requires touching that script as well.
Integration points
- Loader.
src/llama-model-loader.cppreads vocab metadata and constructs thellama_vocab. - Chat templates.
src/llama-chat.cppandcommon/chat.cppconsultllama_vocabfor the model's special tokens. - Sampler. Some samplers (e.g. infill) need vocab-specific token ids; they fetch them from
llama_vocab. - Tools.
tools/tokenize/(nowllama-tokenize) is a thin wrapper aroundllama_tokenizefor shell debugging.
Entry points for modification
- New tokenizer pre-type. Edit
LLAMA_VOCAB_PRE_TYPE_*ininclude/llama.h, add the regex/handler insrc/llama-vocab.cpp, and updateconvert_hf_to_gguf_update.pyto emit the righttokenizer.ggml.prevalue. - New special token. Most special-token plumbing is data-driven; add the GGUF metadata key, then read it in the loader.
- Tokenizer correctness. Always add a case to
tests/test-tokenizer-*with the reference tokenizer's output.
Tests
tests/test-tokenizer-0, test-tokenizer-1-bpe, test-tokenizer-1-spm, test-tokenizer-pre, etc. exercise the implementation against checked-in golden outputs. The models/templates/ directory contains the reference vocab files used for round-trip tests.
Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.
Previous
Architecture switch
Next
Computation graph