Efficient AI Inference Engine for Open-Weight LLMs in C#/.NET
A fully functional LLM inference engine implemented from scratch in C#, capable of loading GGUF models, applying chat templates, tokenizing efficiently, and generating text token-by-token with a modular execution layer designed for multi-device and multi-memory-tier execution.
TL;DR
I designed and implemented a complete LLM inference engine from scratch in C#. The system can load open-weight models stored in GGUF, apply chat templates, tokenize input text using a high-performance TokenTree tokenizer, execute a Llama 3.2–style Transformer forward pass, and decode output tokens into text.
The engine uses a flexible execution architecture (an Execution Manager + pluggable executors) intended to distribute compute and memory across hardware units (CPUs / GPUs) and memory tiers (system RAM / caches / VRAM / SSD offload). Stage 1 (correct CPU inference and architecture abstraction) is complete; Stage 2 (hardware acceleration and expanded architecture support) is in progress.
Stage 1
Working CPU inference + architecture abstraction (Completed)
Stage 2
Hardware acceleration + wider support (In progress)
Stage 3
Training + architecture experimentation (Planned)
Contents
Demos
Figure 1 — User interface (Debug mode)
Screenshot: UI showing token-by-token generation where each new token is appended to the context and used to predict the next token.
(Insert your screenshot here.)
Video — Token-by-token inference on CPU
Demonstrates the autoregressive loop step-by-step: tokenize → forward pass → select next token → append → repeat.
(Embed your video here.)
Video — Token-by-token inference in the Debugger
Shows how the engine operates internally using a debugger. Tokens are visible as integer IDs. In the Llama 3.2 1B configuration used during development, the model contains 16 Transformer layers.
(Embed your debugger video here.)
What the engine does
- Loads open-weight LLMs implemented in a Llama 3.2–style Transformer architecture.
- Parses GGUF (metadata + tensor headers + weight storage) and loads tensors into memory.
- Applies chat templates to align prompts with the model’s training format.
- Tokenizes efficiently using a trie-like TokenTree optimized for common boundaries and fast lookup.
- Runs inference end-to-end (Embedding → Norm → Attention + RoPE → FFN/GLU → Unembedding).
- Decodes output text by mapping token IDs back to strings.
- Execution abstraction designed to distribute compute/memory across devices and memory tiers.
Why this contribution is unique
Most C# “LLM” solutions are wrappers around Python (PyTorch) or C++ engines (e.g., llama.cpp). While wrappers are useful, they constrain developers to the features exposed by the wrapped library, and make it difficult to modify architecture, execution strategy, precision behavior, or memory layout across the stack.
This project is a fully functional inference engine implemented in C#. It provides a unified codebase where model parsing, tokenization, execution, memory management, and architecture components are all accessible and modifiable in one place.
High-level architecture
The engine follows the architecture used by Llama 3.2–style Transformer models:
- Embedding maps token IDs → vectors.
- RMSNorm normalizes the residual stream.
- Attention (Grouped multi-head) computes context mixing (with RoPE positional encoding).
- Residual connections preserve and accumulate information.
- GLU / FFN performs non-linear expansion and contraction.
- Unembedding projects the final hidden vector to vocabulary logits.
(Insert your “Figure 2 — Architecture” diagram here.)
End-to-end inference pipeline
Inference is autoregressive: at each step the model produces a distribution over the next token given the full context so far. The engine currently supports greedy decoding (argmax), with sampling methods planned.
- Input processing: apply a chat template if needed.
- Tokenization: convert input text → token IDs.
- Embedding: map token IDs → embedding vectors.
- Transformer forward pass: run all layers (Norm → Attention(+RoPE) → residual → Norm → GLU → residual).
- Unembedding: compute vocabulary logits from the final hidden vector.
- Decoding: choose the next token (currently greedy argmax).
- Detokenization: token ID → string.
- Append & repeat: append the token to the context and continue generating.
argmax(logits) is equivalent to selecting
argmax(softmax(logits)), so softmax is unnecessary unless sampling or probabilities are needed.
Implementation highlights
Core inference signature
The heart of the system is an inference entry point that takes a prompt and returns generated output. Below is the current structure (as implemented in my engine):
OzAIModel_Ozeki _model;
string Infer(string text)
{
OzLogger.Log(OzAILogger.LogSourceGy, LogLevel.Information,
"Infer called with text: " + text);
if (_model == null)
{
_model = new OzAIModel_Ozeki();
_model.modelPath = GGUF.FileName;
}
try
{
_model.Start(out var error);
if (error != null)
return error;
// Tokenization
if (!_model.Tokenizer.GetTokens(text, out var inputTokens, out var times, out error))
return error;
// Inference
if (!_model.infer(inputTokens, out var outputTokens, out var errorInfer))
return errorInfer;
var res = _model.Tokenizer.GetStringsRaw(outputTokens);
return res;
}
catch (Exception ex)
{
return "Error: " + ex.Message;
}
}
In production usage, this is typically wrapped in a generation loop to repeatedly call inference (or a batched method) until stopping criteria are met.
Componentized architecture
Each architecture component (Embedding, RMSNorm, RoPE, Attention, GLU, Unembedding) is implemented as a modular
OzAIArchComp with its own:
- IO memory contract (what it reads/writes)
- Instance parameters (IParams) (weights, references, executor binding)
- Hyperparameters (HParams) (epsilon, dimensions, scaling, etc.)
Numerical precision strategy
Certain operations require float32 for numerical stability and correct behavior (e.g., reciprocal sqrt in RMSNorm, trig functions in RoPE, exp paths in activations). The engine handles this by cloning inputs into float32 temporary buffers where required, and converting outputs back to the desired dtype.
GGUF parsing
The GGUF loader reads model metadata and tensor headers, then loads tensor data into memory. At a high level:
- Header: magic number (
GGUF), file version, tensor count, metadata count - Metadata entries: name (string), type enum (uint32), value (scalars/arrays)
- Tensor headers: name, dimension count, dimensions, dtype enum, file offset
The parser is designed to read architecture metadata (e.g., model family, embedding length, context length, RoPE parameters) in order to configure the inference graph correctly.
TokenTree tokenization
Tokenization is often implemented with BPE merge rules, but I designed an efficient alternative that performs fast traversal over a token trie (“TokenTree”). The approach is guided by three practical assumptions:
- Most token boundaries align with natural boundaries (spaces, punctuation).
- Longer tokens generally occur less frequently than shorter tokens.
- Minor tokenization errors are often tolerable for generating usable text.
TokenTree stores vocabulary tokens in a continuation structure (e.g., “N” → “Ne” → “New”) so that as we iterate over characters we can efficiently confirm whether a longer token is possible, or terminate when the path no longer matches. Branch decisions are accelerated using a hash-based lookup.
Execution Manager
Primitive operations (vector ops, matrix multiplication, RMS, RoPE, Hadamard products, dot products, etc.) are delegated to an Execution Manager. The manager binds operations to specific executors based on:
- Device: CPU, GPU, and future accelerators (TPU/NPU)
- Data type: float16, float32, and others as needed
- Memory tier: RAM/VRAM/SSD offload (design target for Stage 2)
This abstraction is intended to make the engine adaptable: adding a new executor (e.g., AVX2 CPU kernels, GPU compute, or mixed-precision paths) upgrades performance without rewriting the model architecture code.
Architecture components (overview)
Embedding
Converts token IDs to vectors by indexing a stored embedding vector list and cloning into the output memory node. This avoids one-hot encoding and reduces wasted compute.
RMSNorm
Computes root-mean-square normalization on vectors with an epsilon for stability, then applies a learned gain vector. The operation is performed in float32 to avoid precision issues with reciprocal sqrt.
Attention (Grouped multi-head) + RoPE
Implements grouped multi-head attention with RoPE positional encoding applied to queries and keys. RoPE is implemented as a separate component to support multiple variants (the literature contains multiple valid implementations and scaling strategies).
GLU / FFN
Implements a gated feed-forward block using two projections (top + gate), an activation on the gate path, and a Hadamard product to apply gating, followed by a projection back to model dimension. Intermediate buffers are reused to reduce allocations.
Unembedding
Computes logits for the next token by dotting the final hidden vector against the vocabulary projection. The engine currently performs greedy decoding by selecting the max logit. Sampling strategies (top-k, top-p, temperature) are planned.
Challenges & difficulties
Working “blindfolded”
A major challenge was that end-to-end feedback only appears once every part of the pipeline is correct. You can implement dozens of components and still have no meaningful output until everything is connected.
Debugging numerical issues
Small numerical details matter. Missing an epsilon in RMSNorm or performing unstable dtype conversions can lead to NaNs or subtle drift that completely changes generated tokens.
Scale and memory constraints
LLMs are large enough to stress “normal” programming assumptions: multi-gigabyte tensors, large object heaps, and allocator/GC behavior become a core engineering concern.
Lack of C# reference implementations
Most resources, components, and “known good” implementations are in Python or C/C++. Implementing everything
from scratch required translating tensor operations (often described as einsum) into explicit
vector/matrix operations and ensuring the shapes and semantics match.
Key findings
- Many common implementations are not optimized for modern LLM workloads; better data models and algorithms can significantly improve performance.
- Tokenization is a serious performance hotspot; TokenTree reduces overhead by using fast continuation traversal.
- Hardware abstraction matters: designing for multiple devices and memory tiers early enables future acceleration.
- The field contains significant “noise”: results can be difficult to reproduce, and explanations are often incomplete.
Roadmap
Stage 2 (In progress)
- GPU / NPU execution paths via dedicated executors
- CPU acceleration via AVX2 / AVX-512 kernels
- Improved device-aware memory model (RAM/VRAM/SSD offload)
- Expanded support for additional architectures
Stage 3 (Planned)
- Training support and experimentation with architectural optimizations
- Exploration of “dynamic-weight” or meta-model approaches to improve reasoning capabilities
Resources
Lead-up project: Neural Network Simulator
Before building this engine, I wrote a neural network simulator implementing backpropagation for a small MLP,
including real-time visualization of weight updates during stochastic gradient descent:
https://gyularabai.com/p_6410-neural-network-simulator.html
Lectures & presentations
https://gyularabai.com/p_9109-presentations.html
YouTube lecture series
https://gyularabai.com/p_8677-ai-lectures.html
Book project
https://gyularabai.com/p_8866-create-a-large-language-model-hands-on-guide.html