TokenTree Tokenization for LLMs

by Gyula Rabai

This TokenTree tokenization library implements Byte Pair Encoding (BPE) and SentencePiece (SPM) tokenization algorithms for AI language models. The tokenization algorithms are served by a novel approach called TokenTree, where a B-Tree like structure is built from the tokens for faster evaluation. This approach provides significant speed improvements over traditional Regular Expression (RegExp) based evaluation. The library provides efficient tokenization for GPT-2 and LLaMA-style models with optimized token lookup using the hierarchical token tree structure.

TokenTree Tokenization (with logs)

TokenTree Tokenization (with breakpoints)

File Structure Overview

Core Tokenizer Files:

OzAITokenizer.cs: Abstract base class
OzAITokenizer_Tokenize.cs: Tokenization methods
OzAITokenizer_TokData.cs: Token data management
OzAITokenizer_TokAttr.cs: Token attributes
OzAITokenizer_SpecToks.cs: Special tokens handling

Algorithm Implementations:

BPE/OzAITokenizer_BPE.cs: GPT-2 style BPE
SPM/OzAITokenizer_SPM.cs: LLaMA style SentencePiece

Token Tree Structure:

TokenTree/OzAITokenTree.cs: Main tree structure
TokenTree/OzAITokenTree_Node.cs: Tree node implementation
TokenTree/OzAIToken.cs: Token representation

Utility Classes:

ByteArrayComp/OzByteArrayComparer.cs: Byte array comparison
OzAIVocabType.cs: Vocabulary type definitions
OzAIVocabPreType.cs: Preprocessing types

Architecture Overview

Tokenizer Architecture

┌────────────────────────────────────────────────────────┐
│                 OzAITokenizer (Abstract Base)          │
├────────────────────────────────────────────────────────┤
│  • Space escaping & unescaping                         │
│  • Factory method: CreateFromFile()                    │
│  • Token tree building & management                    │
│  • Special token handling                              │
└─────────────────┬──────────────────────────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
┌───────▼────────┐  ┌───────▼────────┐
│ OzAITokenizer_ │  │ OzAITokenizer_ │
│      BPE       │  │      SPM       │
│  (GPT-2)       │  │   (LLaMA)      │
└────────────────┘  └────────────────┘
        │                   │
        └─────────┬─────────┘
                  │
┌─────────────────▼──────────────────────────────────────┐
│               TokenTree Structure                      │
├────────────────────────────────────────────────────────┤
│  • OzAITokenTree (Root Node)                           │
│  • OzAITokenTree_Node (Hierarchical Lookup)            │
│  • OzAIToken (Token Representation)                    │
└────────────────────────────────────────────────────────┘

Detailed File Documentation

OzAITokenizer.cs: Abstract Base Class

Purpose: Provides the foundation for all tokenizer implementations with common functionality.

Key Features:

Space Escaping: Handles space character escaping/unescaping (BPE uses "Ġ", SPM uses "▁")
Factory Pattern: CreateFromFile() method that determines tokenizer type from GGUF file metadata
Model Initialization: Abstract method ModelInit() for algorithm-specific initialization
Token Tree Building: buildTokTree() constructs hierarchical lookup structure

RGameLib.py

public abstract partial class OzAITokenizer
{
// Space escaping functionality
public string SpaceEscape { get; }
public string UnescapeSpace(string text)

// Factory method for creating tokenizers from GGUF files
public static bool CreateFromFile(OzGGUFFile file, out OzAITokenizer res, out string error)

// Token tree building
public void buildTokTree()
}

OzAITokenizer_Tokenize.cs: Tokenization Logic

Purpose: Contains the core tokenization methods and token-to-text conversion utilities.

Key Methods:

GetTokens(): Main tokenization method with BOS/EOS handling
Tokenize(): Abstract method for algorithm-specific tokenization
ByteArray2Tokens(): Converts byte arrays to token sequences
Byte2TokenID(): Maps individual bytes to token IDs
VisualizeTokens(): Debug utility for token visualization
GetStrings() / GetStringsRaw(): Converts token sequences back to text

Important: The tokenization process automatically handles:

Beginning of Sequence (BOS) token insertion
Space prefix handling (SPM adds space, BPE doesn't)
End of Sequence (EOS) token insertion
Unknown token fallback mechanisms

OzAITokenizer_TokData.cs: Token Data Management

Purpose: Manages token vocabulary data loading and preprocessing from GGUF files.

Key Components:

Tokens: List of all tokens with ID, text, and score
Len2ID: Priority queue organizing tokens by length for efficient lookup
TokenTree: Hierarchical tree structure for fast token matching
MaxTokenLen / AvgTokenLen: Statistics for optimization
ByteArrayComparer: Custom equality comparer for byte arrays

BPE/OzAITokenizer_BPE.cs: GPT-2 Style BPE

Algorithm: Byte Pair Encoding with GPT-2 style tokenization

Characteristics:

Space Escape: Uses "Ġ" character for space representation
BOS/EOS: Both set to token ID 11
Unknown Token: Not used (-1)
Space Prefix: Not added (false)
BOS Addition: Enabled (true)

SPM/OzAITokenizer_SPM.cs: LLaMA Style SentencePiece

Algorithm: SentencePiece with LLaMA-style configuration

Characteristics:

Space Escape: Uses "▁" character for space representation
BOS Token: ID 1
EOS Token: ID 2
Unknown Token: ID 0
Space Prefix: Added (true)
BOS Addition: Enabled (true)
EOS Addition: Disabled (false)

TokenTree/OzAITokenTree.cs: Hierarchical Token Lookup

Purpose: Provides O(n) worst-case, O(1) average-case token lookup using a trie-like structure.

Key Methods:

Add(): Inserts tokens into the tree structure
Get(): Retrieves token ID for given byte sequence

Performance: The tree structure allows for efficient longest-prefix matching, which is crucial for BPE tokenization where longer tokens should be matched before shorter ones.

TokenTree/OzAITokenTree_Node.cs: Tree Node Implementation

Purpose: Implements individual nodes in the token tree with optimized child lookup.

Key Features:

Length-Based Indexing: Maintains separate list of token lengths for optimization
Byte Array Keys: Uses custom OzByteArrayComparer for dictionary keys
TryToken(): Efficient token matching with length-based optimization

TokenTree/OzAIToken.cs: Token Representation

Purpose: Data structure representing individual tokens with their metadata.

Properties:

ID: Unique token identifier
Text: Byte representation of token text
Score: Frequency score (commented out token type flags)

ByteArrayComp/OzByteArrayComparer.cs: Byte Array Comparison

Purpose: Custom equality comparer for byte arrays used as dictionary keys in the token tree.

Features:

Element-by-element comparison for accuracy
Efficient hash code generation using first 4 bytes
Null-safe comparison

OzAIVocabType.cs: Vocabulary Type Definitions

Purpose: Enumeration defining supported vocabulary types.

Type	Value	Description	Support Status
`NONE`	0	Models without vocabulary	Not Supported
`SPM`	1	LLaMA tokenizer based on byte-level BPE with byte fallback	Fully Supported
`BPE`	2	GPT-2 tokenizer based on byte-level BPE	Fully Supported
`WPM`	3	BERT tokenizer based on WordPiece	Not Supported
`UGM`	4	T5 tokenizer based on Unigram	Not Supported

Usage Examples

// Load tokenizer from GGUF file
if (OzAITokenizer.CreateFromFile(ggufFile, out var tokenizer, out string error))
{
	// Tokenize text
	tokenizer.GetTokens("Hello, world!", out var tokens, out var times);
}