TokenTree Tokenization for LLMs
by Gyula Rabai
This TokenTree tokenization library implements Byte Pair Encoding (BPE) and SentencePiece (SPM) tokenization algorithms for AI language models. The tokenization algorithms are served by a novel approach called TokenTree, where a B-Tree like structure is built from the tokens for faster evaluation. This approach provides significant speed improvements over traditional Regular Expression (RegExp) based evaluation. The library provides efficient tokenization for GPT-2 and LLaMA-style models with optimized token lookup using the hierarchical token tree structure.
TokenTree Tokenization (with logs)
TokenTree Tokenization (with breakpoints)
File Structure Overview
Core Tokenizer Files:
OzAITokenizer.cs: Abstract base classOzAITokenizer_Tokenize.cs: Tokenization methodsOzAITokenizer_TokData.cs: Token data managementOzAITokenizer_TokAttr.cs: Token attributesOzAITokenizer_SpecToks.cs: Special tokens handling
Algorithm Implementations:
BPE/OzAITokenizer_BPE.cs: GPT-2 style BPESPM/OzAITokenizer_SPM.cs: LLaMA style SentencePiece
Token Tree Structure:
TokenTree/OzAITokenTree.cs: Main tree structureTokenTree/OzAITokenTree_Node.cs: Tree node implementationTokenTree/OzAIToken.cs: Token representation
Utility Classes:
ByteArrayComp/OzByteArrayComparer.cs: Byte array comparisonOzAIVocabType.cs: Vocabulary type definitionsOzAIVocabPreType.cs: Preprocessing types
Architecture Overview
Tokenizer Architecture
┌────────────────────────────────────────────────────────┐
│ OzAITokenizer (Abstract Base) │
├────────────────────────────────────────────────────────┤
│ • Space escaping & unescaping │
│ • Factory method: CreateFromFile() │
│ • Token tree building & management │
│ • Special token handling │
└─────────────────┬──────────────────────────────────────┘
│
┌─────────┴─────────┐
│ │
┌───────▼────────┐ ┌───────▼────────┐
│ OzAITokenizer_ │ │ OzAITokenizer_ │
│ BPE │ │ SPM │
│ (GPT-2) │ │ (LLaMA) │
└────────────────┘ └────────────────┘
│ │
└─────────┬─────────┘
│
┌─────────────────▼──────────────────────────────────────┐
│ TokenTree Structure │
├────────────────────────────────────────────────────────┤
│ • OzAITokenTree (Root Node) │
│ • OzAITokenTree_Node (Hierarchical Lookup) │
│ • OzAIToken (Token Representation) │
└────────────────────────────────────────────────────────┘
Detailed File Documentation
OzAITokenizer.cs: Abstract Base Class
Purpose: Provides the foundation for all tokenizer implementations with common functionality.
Key Features:
- Space Escaping: Handles space character escaping/unescaping (BPE uses "Ġ", SPM uses "▁")
- Factory Pattern:
CreateFromFile()method that determines tokenizer type from GGUF file metadata - Model Initialization: Abstract method
ModelInit()for algorithm-specific initialization - Token Tree Building:
buildTokTree()constructs hierarchical lookup structure
RGameLib.py
public abstract partial class OzAITokenizer
{
// Space escaping functionality
public string SpaceEscape { get; }
public string UnescapeSpace(string text)
// Factory method for creating tokenizers from GGUF files
public static bool CreateFromFile(OzGGUFFile file, out OzAITokenizer res, out string error)
// Token tree building
public void buildTokTree()
}
OzAITokenizer_Tokenize.cs: Tokenization Logic
Purpose: Contains the core tokenization methods and token-to-text conversion utilities.
Key Methods:
GetTokens(): Main tokenization method with BOS/EOS handlingTokenize(): Abstract method for algorithm-specific tokenizationByteArray2Tokens(): Converts byte arrays to token sequencesByte2TokenID(): Maps individual bytes to token IDsVisualizeTokens(): Debug utility for token visualizationGetStrings() / GetStringsRaw(): Converts token sequences back to text
Important: The tokenization process automatically handles:
- Beginning of Sequence (BOS) token insertion
- Space prefix handling (SPM adds space, BPE doesn't)
- End of Sequence (EOS) token insertion
- Unknown token fallback mechanisms
OzAITokenizer_TokData.cs: Token Data Management
Purpose: Manages token vocabulary data loading and preprocessing from GGUF files.
Key Components:
Tokens: List of all tokens with ID, text, and scoreLen2ID: Priority queue organizing tokens by length for efficient lookupTokenTree: Hierarchical tree structure for fast token matchingMaxTokenLen / AvgTokenLen: Statistics for optimizationByteArrayComparer: Custom equality comparer for byte arrays
BPE/OzAITokenizer_BPE.cs: GPT-2 Style BPE
Algorithm: Byte Pair Encoding with GPT-2 style tokenization
Characteristics:
- Space Escape: Uses "Ġ" character for space representation
- BOS/EOS: Both set to token ID 11
- Unknown Token: Not used (-1)
- Space Prefix: Not added (false)
- BOS Addition: Enabled (true)
SPM/OzAITokenizer_SPM.cs: LLaMA Style SentencePiece
Algorithm: SentencePiece with LLaMA-style configuration
Characteristics:
- Space Escape: Uses "▁" character for space representation
- BOS Token: ID 1
- EOS Token: ID 2
- Unknown Token: ID 0
- Space Prefix: Added (true)
- BOS Addition: Enabled (true)
- EOS Addition: Disabled (false)
TokenTree/OzAITokenTree.cs: Hierarchical Token Lookup
Purpose: Provides O(n) worst-case, O(1) average-case token lookup using a trie-like structure.
Key Methods:
Add(): Inserts tokens into the tree structureGet(): Retrieves token ID for given byte sequence
Performance: The tree structure allows for efficient longest-prefix matching, which is crucial for BPE tokenization where longer tokens should be matched before shorter ones.
TokenTree/OzAITokenTree_Node.cs: Tree Node Implementation
Purpose: Implements individual nodes in the token tree with optimized child lookup.
Key Features:
- Length-Based Indexing: Maintains separate list of token lengths for optimization
- Byte Array Keys: Uses custom
OzByteArrayComparerfor dictionary keys - TryToken(): Efficient token matching with length-based optimization
TokenTree/OzAIToken.cs: Token Representation
Purpose: Data structure representing individual tokens with their metadata.
Properties:
ID: Unique token identifierText: Byte representation of token textScore: Frequency score (commented out token type flags)
ByteArrayComp/OzByteArrayComparer.cs: Byte Array Comparison
Purpose: Custom equality comparer for byte arrays used as dictionary keys in the token tree.
Features:
- Element-by-element comparison for accuracy
- Efficient hash code generation using first 4 bytes
- Null-safe comparison
OzAIVocabType.cs: Vocabulary Type Definitions
Purpose: Enumeration defining supported vocabulary types.
| Type | Value | Description | Support Status |
|---|---|---|---|
NONE |
0 | Models without vocabulary | Not Supported |
SPM |
1 | LLaMA tokenizer based on byte-level BPE with byte fallback | Fully Supported |
BPE |
2 | GPT-2 tokenizer based on byte-level BPE | Fully Supported |
WPM |
3 | BERT tokenizer based on WordPiece | Not Supported |
UGM |
4 | T5 tokenizer based on Unigram | Not Supported |
Usage Examples
// Load tokenizer from GGUF file
if (OzAITokenizer.CreateFromFile(ggufFile, out var tokenizer, out string error))
{
// Tokenize text
tokenizer.GetTokens("Hello, world!", out var tokens, out var times);
}