Factory.ai

Open-Source Wikis

/

llama.cpp

/

Systems

/

Vocab and tokenizer

ggml-org/llama.cpp

Vocab and tokenizer

Active contributors: Sigbjørn Skjæret, Georgi Gerganov

src/llama-vocab.cpp is the in-tree tokenizer. It implements every tokenizer family used by the supported models: Byte-Pair Encoding (BPE), SentencePiece (SPM), WordPiece (WPM), Unigram (UGM), and the byte-level RWKV tokenizer. Vocab data is read from GGUF metadata at load time.

Purpose

  • Decode the vocab section of a GGUF file into an in-memory llama_vocab.
  • Tokenize text → token ids and back, matching the reference HuggingFace tokenizer for that model.
  • Provide special-token bookkeeping (bos, eos, pad, cls, sep, plus per-model tokens like <|im_start|> or <|tool_call|>).

Directory layout

src/
├── llama-vocab.h          # llama_vocab + types
├── llama-vocab.cpp        # ~166 KB; all five tokenizer implementations
├── unicode.cpp / .h       # codepoint helpers, NFC normalization, byte-fallback
└── unicode-data.cpp / .h  # generated Unicode tables (categories, lowercasing, etc.)

Key abstractions

Type / value Role File
enum llama_vocab_type LLAMA_VOCAB_TYPE_NONE/SPM/BPE/WPM/UGM/RWKV include/llama.h
enum llama_vocab_pre_type Per-model pre-tokenizer regex variant (LLAMA_VOCAB_PRE_TYPE_LLAMA3, ..._GPT2, ..._QWEN2, ...) include/llama.h
llama_vocab (private) Holds vocab data, special tokens, BPE merges, regex pre-tokenizer state src/llama-vocab.h
llama_token int32_t typedef for token ids include/llama.h
llm_tokenizer_* classes Per-family tokenizer implementations src/llama-vocab.cpp

How it works

The loader reads three GGUF metadata sections to populate the vocab:

  1. tokenizer.ggml.tokens and friends — the actual vocabulary list, scores, and token types.
  2. tokenizer.ggml.bos_token_id, ..._eos_token_id, etc. — special token ids.
  3. tokenizer.ggml.pre — a string identifying the pre-tokenizer (e.g. "llama3", "gpt-2", "qwen2"). This is hashed against a known set of regexes embedded in convert_hf_to_gguf_update.py to determine the pre-tokenizer variant.
graph LR
    Text -->|llama_tokenize| Pre[Pre-tokenizer regex split]
    Pre --> Tokenizer{vocab_type}
    Tokenizer -->|BPE| BPE[BPE merges]
    Tokenizer -->|SPM| SPM[Sentencepiece + byte fallback]
    Tokenizer -->|WPM| WPM[WordPiece greedy match]
    Tokenizer -->|UGM| UGM[Unigram lattice viterbi]
    Tokenizer -->|RWKV| RWKV[Trie-based byte tokenizer]
    BPE --> Tokens[token ids]
    SPM --> Tokens
    WPM --> Tokens
    UGM --> Tokens
    RWKV --> Tokens

Detokenization is the inverse — llama_token_to_piece returns the printable bytes for a single token, with byte-fallback handling for invalid UTF-8 sequences.

Pre-tokenizer registry

Different LLaMA-3-style models use slightly different regex pre-tokenizers. To avoid embedding every regex in C++, llama.cpp computes a stable hash of the regex from the upstream tokenizer JSON and stores the name of the matching family in tokenizer.ggml.pre. The Python script convert_hf_to_gguf_update.py keeps the hash table in sync. This is why adding a new tokenizer family typically requires touching that script as well.

Integration points

  • Loader. src/llama-model-loader.cpp reads vocab metadata and constructs the llama_vocab.
  • Chat templates. src/llama-chat.cpp and common/chat.cpp consult llama_vocab for the model's special tokens.
  • Sampler. Some samplers (e.g. infill) need vocab-specific token ids; they fetch them from llama_vocab.
  • Tools. tools/tokenize/ (now llama-tokenize) is a thin wrapper around llama_tokenize for shell debugging.

Entry points for modification

  • New tokenizer pre-type. Edit LLAMA_VOCAB_PRE_TYPE_* in include/llama.h, add the regex/handler in src/llama-vocab.cpp, and update convert_hf_to_gguf_update.py to emit the right tokenizer.ggml.pre value.
  • New special token. Most special-token plumbing is data-driven; add the GGUF metadata key, then read it in the loader.
  • Tokenizer correctness. Always add a case to tests/test-tokenizer-* with the reference tokenizer's output.

Tests

tests/test-tokenizer-0, test-tokenizer-1-bpe, test-tokenizer-1-spm, test-tokenizer-pre, etc. exercise the implementation against checked-in golden outputs. The models/templates/ directory contains the reference vocab files used for round-trip tests.

Built by Factory AutoWiki from public repository content. It is a generated preview for codebase exploration, not source-maintained documentation.

Vocab and tokenizer – llama.cpp wiki | Factory