Gyula Rabai

TokenTree Tokenization for LLMs

by Gyula Rabai

This TokenTree tokenization library implements Byte Pair Encoding (BPE) and SentencePiece (SPM) tokenization algorithms for AI language models. The tokenization algorithms are served by a novel approach called TokenTree, where a B-Tree like structure is built from the tokens for faster evaluation. This approach provides significant speed improvements over traditional Regular Expression (RegExp) based evaluation. The library provides efficient tokenization for GPT-2 and LLaMA-style models with optimized token lookup using the hierarchical token tree structure.

TokenTree Tokenization (with logs)

TokenTree Tokenization (with breakpoints)

File Structure Overview

Core Tokenizer Files:
  • OzAITokenizer.cs: Abstract base class
  • OzAITokenizer_Tokenize.cs: Tokenization methods
  • OzAITokenizer_TokData.cs: Token data management
  • OzAITokenizer_TokAttr.cs: Token attributes
  • OzAITokenizer_SpecToks.cs: Special tokens handling
Algorithm Implementations:
  • BPE/OzAITokenizer_BPE.cs: GPT-2 style BPE
  • SPM/OzAITokenizer_SPM.cs: LLaMA style SentencePiece
Token Tree Structure:
  • TokenTree/OzAITokenTree.cs: Main tree structure
  • TokenTree/OzAITokenTree_Node.cs: Tree node implementation
  • TokenTree/OzAIToken.cs: Token representation
Utility Classes:
  • ByteArrayComp/OzByteArrayComparer.cs: Byte array comparison
  • OzAIVocabType.cs: Vocabulary type definitions
  • OzAIVocabPreType.cs: Preprocessing types

Architecture Overview

Tokenizer Architecture

┌────────────────────────────────────────────────────────┐
│                 OzAITokenizer (Abstract Base)          │
├────────────────────────────────────────────────────────┤
│  • Space escaping & unescaping                         │
│  • Factory method: CreateFromFile()                    │
│  • Token tree building & management                    │
│  • Special token handling                              │
└─────────────────┬──────────────────────────────────────┘
                  │
        ┌─────────┴─────────┐
        │                   │
┌───────▼────────┐  ┌───────▼────────┐
│ OzAITokenizer_ │  │ OzAITokenizer_ │
│      BPE       │  │      SPM       │
│  (GPT-2)       │  │   (LLaMA)      │
└────────────────┘  └────────────────┘
        │                   │
        └─────────┬─────────┘
                  │
┌─────────────────▼──────────────────────────────────────┐
│               TokenTree Structure                      │
├────────────────────────────────────────────────────────┤
│  • OzAITokenTree (Root Node)                           │
│  • OzAITokenTree_Node (Hierarchical Lookup)            │
│  • OzAIToken (Token Representation)                    │
└────────────────────────────────────────────────────────┘
    

Detailed File Documentation

OzAITokenizer.cs: Abstract Base Class

Purpose: Provides the foundation for all tokenizer implementations with common functionality.

Key Features:

  • Space Escaping: Handles space character escaping/unescaping (BPE uses "Ġ", SPM uses "▁")
  • Factory Pattern: CreateFromFile() method that determines tokenizer type from GGUF file metadata
  • Model Initialization: Abstract method ModelInit() for algorithm-specific initialization
  • Token Tree Building: buildTokTree() constructs hierarchical lookup structure

RGameLib.py

public abstract partial class OzAITokenizer
{
// Space escaping functionality
public string SpaceEscape { get; }
public string UnescapeSpace(string text)

// Factory method for creating tokenizers from GGUF files
public static bool CreateFromFile(OzGGUFFile file, out OzAITokenizer res, out string error)

// Token tree building
public void buildTokTree()
}

OzAITokenizer_Tokenize.cs: Tokenization Logic

Purpose: Contains the core tokenization methods and token-to-text conversion utilities.

Key Methods:

  • GetTokens(): Main tokenization method with BOS/EOS handling
  • Tokenize(): Abstract method for algorithm-specific tokenization
  • ByteArray2Tokens(): Converts byte arrays to token sequences
  • Byte2TokenID(): Maps individual bytes to token IDs
  • VisualizeTokens(): Debug utility for token visualization
  • GetStrings() / GetStringsRaw(): Converts token sequences back to text
Important: The tokenization process automatically handles:
  • Beginning of Sequence (BOS) token insertion
  • Space prefix handling (SPM adds space, BPE doesn't)
  • End of Sequence (EOS) token insertion
  • Unknown token fallback mechanisms

OzAITokenizer_TokData.cs: Token Data Management

Purpose: Manages token vocabulary data loading and preprocessing from GGUF files.

Key Components:

  • Tokens: List of all tokens with ID, text, and score
  • Len2ID: Priority queue organizing tokens by length for efficient lookup
  • TokenTree: Hierarchical tree structure for fast token matching
  • MaxTokenLen / AvgTokenLen: Statistics for optimization
  • ByteArrayComparer: Custom equality comparer for byte arrays

BPE/OzAITokenizer_BPE.cs: GPT-2 Style BPE

Algorithm: Byte Pair Encoding with GPT-2 style tokenization

Characteristics:

  • Space Escape: Uses "Ġ" character for space representation
  • BOS/EOS: Both set to token ID 11
  • Unknown Token: Not used (-1)
  • Space Prefix: Not added (false)
  • BOS Addition: Enabled (true)

SPM/OzAITokenizer_SPM.cs: LLaMA Style SentencePiece

Algorithm: SentencePiece with LLaMA-style configuration

Characteristics:

  • Space Escape: Uses "▁" character for space representation
  • BOS Token: ID 1
  • EOS Token: ID 2
  • Unknown Token: ID 0
  • Space Prefix: Added (true)
  • BOS Addition: Enabled (true)
  • EOS Addition: Disabled (false)

TokenTree/OzAITokenTree.cs: Hierarchical Token Lookup

Purpose: Provides O(n) worst-case, O(1) average-case token lookup using a trie-like structure.

Key Methods:

  • Add(): Inserts tokens into the tree structure
  • Get(): Retrieves token ID for given byte sequence
Performance: The tree structure allows for efficient longest-prefix matching, which is crucial for BPE tokenization where longer tokens should be matched before shorter ones.

TokenTree/OzAITokenTree_Node.cs: Tree Node Implementation

Purpose: Implements individual nodes in the token tree with optimized child lookup.

Key Features:

  • Length-Based Indexing: Maintains separate list of token lengths for optimization
  • Byte Array Keys: Uses custom OzByteArrayComparer for dictionary keys
  • TryToken(): Efficient token matching with length-based optimization

TokenTree/OzAIToken.cs: Token Representation

Purpose: Data structure representing individual tokens with their metadata.

Properties:

  • ID: Unique token identifier
  • Text: Byte representation of token text
  • Score: Frequency score (commented out token type flags)

ByteArrayComp/OzByteArrayComparer.cs: Byte Array Comparison

Purpose: Custom equality comparer for byte arrays used as dictionary keys in the token tree.

Features:

  • Element-by-element comparison for accuracy
  • Efficient hash code generation using first 4 bytes
  • Null-safe comparison

OzAIVocabType.cs: Vocabulary Type Definitions

Purpose: Enumeration defining supported vocabulary types.
Type Value Description Support Status
NONE 0 Models without vocabulary Not Supported
SPM 1 LLaMA tokenizer based on byte-level BPE with byte fallback Fully Supported
BPE 2 GPT-2 tokenizer based on byte-level BPE Fully Supported
WPM 3 BERT tokenizer based on WordPiece Not Supported
UGM 4 T5 tokenizer based on Unigram Not Supported

Usage Examples

// Load tokenizer from GGUF file
if (OzAITokenizer.CreateFromFile(ggufFile, out var tokenizer, out string error))
{
	// Tokenize text
	tokenizer.GetTokens("Hello, world!", out var tokens, out var times);
}

More information


Projects | Books | Printouts | On-line lectures | Presentations