Gyula Rabai

Efficient AI Inference Engine for Open-Weight LLMs in C#/.NET

A fully functional LLM inference engine implemented from scratch in C#, capable of loading GGUF models, applying chat templates, tokenizing efficiently, and generating text token-by-token with a modular execution layer designed for multi-device and multi-memory-tier execution.

C# / .NET GGUF parser Tokenizer (TokenTree) Llama 3.2–style Transformer Modular execution manager CPU debug mode (token-by-token) Multi-device & memory-tier design

TL;DR

I designed and implemented a complete LLM inference engine from scratch in C#. The system can load open-weight models stored in GGUF, apply chat templates, tokenize input text using a high-performance TokenTree tokenizer, execute a Llama 3.2–style Transformer forward pass, and decode output tokens into text.

The engine uses a flexible execution architecture (an Execution Manager + pluggable executors) intended to distribute compute and memory across hardware units (CPUs / GPUs) and memory tiers (system RAM / caches / VRAM / SSD offload). Stage 1 (correct CPU inference and architecture abstraction) is complete; Stage 2 (hardware acceleration and expanded architecture support) is in progress.

Stage 1

Working CPU inference + architecture abstraction (Completed)

Stage 2

Hardware acceleration + wider support (In progress)

Stage 3

Training + architecture experimentation (Planned)

Contents


Demos

Figure 1 — User interface (Debug mode)

Screenshot: UI showing token-by-token generation where each new token is appended to the context and used to predict the next token.

(Insert your screenshot here.)

Video — Token-by-token inference on CPU

Demonstrates the autoregressive loop step-by-step: tokenize → forward pass → select next token → append → repeat.

(Embed your video here.)

Video — Token-by-token inference in the Debugger

Shows how the engine operates internally using a debugger. Tokens are visible as integer IDs. In the Llama 3.2 1B configuration used during development, the model contains 16 Transformer layers.

(Embed your debugger video here.)

What the engine does

  • Loads open-weight LLMs implemented in a Llama 3.2–style Transformer architecture.
  • Parses GGUF (metadata + tensor headers + weight storage) and loads tensors into memory.
  • Applies chat templates to align prompts with the model’s training format.
  • Tokenizes efficiently using a trie-like TokenTree optimized for common boundaries and fast lookup.
  • Runs inference end-to-end (Embedding → Norm → Attention + RoPE → FFN/GLU → Unembedding).
  • Decodes output text by mapping token IDs back to strings.
  • Execution abstraction designed to distribute compute/memory across devices and memory tiers.

Why this contribution is unique

Most C# “LLM” solutions are wrappers around Python (PyTorch) or C++ engines (e.g., llama.cpp). While wrappers are useful, they constrain developers to the features exposed by the wrapped library, and make it difficult to modify architecture, execution strategy, precision behavior, or memory layout across the stack.

This project is a fully functional inference engine implemented in C#. It provides a unified codebase where model parsing, tokenization, execution, memory management, and architecture components are all accessible and modifiable in one place.

High-level architecture

The engine follows the architecture used by Llama 3.2–style Transformer models:

  • Embedding maps token IDs → vectors.
  • RMSNorm normalizes the residual stream.
  • Attention (Grouped multi-head) computes context mixing (with RoPE positional encoding).
  • Residual connections preserve and accumulate information.
  • GLU / FFN performs non-linear expansion and contraction.
  • Unembedding projects the final hidden vector to vocabulary logits.

(Insert your “Figure 2 — Architecture” diagram here.)

End-to-end inference pipeline

Inference is autoregressive: at each step the model produces a distribution over the next token given the full context so far. The engine currently supports greedy decoding (argmax), with sampling methods planned.

  1. Input processing: apply a chat template if needed.
  2. Tokenization: convert input text → token IDs.
  3. Embedding: map token IDs → embedding vectors.
  4. Transformer forward pass: run all layers (Norm → Attention(+RoPE) → residual → Norm → GLU → residual).
  5. Unembedding: compute vocabulary logits from the final hidden vector.
  6. Decoding: choose the next token (currently greedy argmax).
  7. Detokenization: token ID → string.
  8. Append & repeat: append the token to the context and continue generating.
Greedy decoding detail: selecting argmax(logits) is equivalent to selecting argmax(softmax(logits)), so softmax is unnecessary unless sampling or probabilities are needed.

Implementation highlights

Core inference signature

The heart of the system is an inference entry point that takes a prompt and returns generated output. Below is the current structure (as implemented in my engine):

OzAIModel_Ozeki _model;

string Infer(string text)
{
    OzLogger.Log(OzAILogger.LogSourceGy, LogLevel.Information,
        "Infer called with text: " + text);

    if (_model == null)
    {
        _model = new OzAIModel_Ozeki();
        _model.modelPath = GGUF.FileName;
    }

    try
    {
        _model.Start(out var error);
        if (error != null)
            return error;

        // Tokenization
        if (!_model.Tokenizer.GetTokens(text, out var inputTokens, out var times, out error))
            return error;

        // Inference
        if (!_model.infer(inputTokens, out var outputTokens, out var errorInfer))
            return errorInfer;

        var res = _model.Tokenizer.GetStringsRaw(outputTokens);
        return res;
    }
    catch (Exception ex)
    {
        return "Error: " + ex.Message;
    }
}

In production usage, this is typically wrapped in a generation loop to repeatedly call inference (or a batched method) until stopping criteria are met.

Componentized architecture

Each architecture component (Embedding, RMSNorm, RoPE, Attention, GLU, Unembedding) is implemented as a modular OzAIArchComp with its own:

  • IO memory contract (what it reads/writes)
  • Instance parameters (IParams) (weights, references, executor binding)
  • Hyperparameters (HParams) (epsilon, dimensions, scaling, etc.)

Numerical precision strategy

Certain operations require float32 for numerical stability and correct behavior (e.g., reciprocal sqrt in RMSNorm, trig functions in RoPE, exp paths in activations). The engine handles this by cloning inputs into float32 temporary buffers where required, and converting outputs back to the desired dtype.

GGUF parsing

The GGUF loader reads model metadata and tensor headers, then loads tensor data into memory. At a high level:

  • Header: magic number (GGUF), file version, tensor count, metadata count
  • Metadata entries: name (string), type enum (uint32), value (scalars/arrays)
  • Tensor headers: name, dimension count, dimensions, dtype enum, file offset

The parser is designed to read architecture metadata (e.g., model family, embedding length, context length, RoPE parameters) in order to configure the inference graph correctly.

TokenTree tokenization

Tokenization is often implemented with BPE merge rules, but I designed an efficient alternative that performs fast traversal over a token trie (“TokenTree”). The approach is guided by three practical assumptions:

  • Most token boundaries align with natural boundaries (spaces, punctuation).
  • Longer tokens generally occur less frequently than shorter tokens.
  • Minor tokenization errors are often tolerable for generating usable text.

TokenTree stores vocabulary tokens in a continuation structure (e.g., “N” → “Ne” → “New”) so that as we iterate over characters we can efficiently confirm whether a longer token is possible, or terminate when the path no longer matches. Branch decisions are accelerated using a hash-based lookup.

Benchmark note: if you claim speedups (e.g., “6× faster”), include the test setup (hardware, prompt size, number of runs, thread count) to make the result reproducible and convincing.

Execution Manager

Primitive operations (vector ops, matrix multiplication, RMS, RoPE, Hadamard products, dot products, etc.) are delegated to an Execution Manager. The manager binds operations to specific executors based on:

  • Device: CPU, GPU, and future accelerators (TPU/NPU)
  • Data type: float16, float32, and others as needed
  • Memory tier: RAM/VRAM/SSD offload (design target for Stage 2)

This abstraction is intended to make the engine adaptable: adding a new executor (e.g., AVX2 CPU kernels, GPU compute, or mixed-precision paths) upgrades performance without rewriting the model architecture code.

Architecture components (overview)

Embedding

Converts token IDs to vectors by indexing a stored embedding vector list and cloning into the output memory node. This avoids one-hot encoding and reduces wasted compute.

RMSNorm

Computes root-mean-square normalization on vectors with an epsilon for stability, then applies a learned gain vector. The operation is performed in float32 to avoid precision issues with reciprocal sqrt.

Attention (Grouped multi-head) + RoPE

Implements grouped multi-head attention with RoPE positional encoding applied to queries and keys. RoPE is implemented as a separate component to support multiple variants (the literature contains multiple valid implementations and scaling strategies).

GLU / FFN

Implements a gated feed-forward block using two projections (top + gate), an activation on the gate path, and a Hadamard product to apply gating, followed by a projection back to model dimension. Intermediate buffers are reused to reduce allocations.

Unembedding

Computes logits for the next token by dotting the final hidden vector against the vocabulary projection. The engine currently performs greedy decoding by selecting the max logit. Sampling strategies (top-k, top-p, temperature) are planned.

Challenges & difficulties

Working “blindfolded”

A major challenge was that end-to-end feedback only appears once every part of the pipeline is correct. You can implement dozens of components and still have no meaningful output until everything is connected.

Debugging numerical issues

Small numerical details matter. Missing an epsilon in RMSNorm or performing unstable dtype conversions can lead to NaNs or subtle drift that completely changes generated tokens.

Scale and memory constraints

LLMs are large enough to stress “normal” programming assumptions: multi-gigabyte tensors, large object heaps, and allocator/GC behavior become a core engineering concern.

Lack of C# reference implementations

Most resources, components, and “known good” implementations are in Python or C/C++. Implementing everything from scratch required translating tensor operations (often described as einsum) into explicit vector/matrix operations and ensuring the shapes and semantics match.

Key findings

  • Many common implementations are not optimized for modern LLM workloads; better data models and algorithms can significantly improve performance.
  • Tokenization is a serious performance hotspot; TokenTree reduces overhead by using fast continuation traversal.
  • Hardware abstraction matters: designing for multiple devices and memory tiers early enables future acceleration.
  • The field contains significant “noise”: results can be difficult to reproduce, and explanations are often incomplete.

Roadmap

Stage 2 (In progress)

  • GPU / NPU execution paths via dedicated executors
  • CPU acceleration via AVX2 / AVX-512 kernels
  • Improved device-aware memory model (RAM/VRAM/SSD offload)
  • Expanded support for additional architectures

Stage 3 (Planned)

  • Training support and experimentation with architectural optimizations
  • Exploration of “dynamic-weight” or meta-model approaches to improve reasoning capabilities
Decoding roadmap: add temperature, top-k, and top-p sampling, plus repetition penalties and stopping criteria.

Resources

Lead-up project: Neural Network Simulator

Before building this engine, I wrote a neural network simulator implementing backpropagation for a small MLP, including real-time visualization of weight updates during stochastic gradient descent:
https://gyularabai.com/p_6410-neural-network-simulator.html

Lectures & presentations

https://gyularabai.com/p_9109-presentations.html

YouTube lecture series

https://gyularabai.com/p_8677-ai-lectures.html

Book project

https://gyularabai.com/p_8866-create-a-large-language-model-hands-on-guide.html

More information


Projects | Books | Printouts | On-line lectures | Presentations