Gyula Rabai

AI Inference engine for LLMs

I have designed and implemented a fully working AI inference engine from scratch in C#. It can run open weight Large Language Models (LLMs), such as LLama3. It can read, and parse GGUF files and can follow chat templates. The inference engine has a very efficient execution model, that is capable of distributing load and memory usage among multiple processing units (CPUs, GPUs) and multiple memory blocks (System RAM, VRam, offloading to SSD).

On this page you can see the inference engine in DEBUG mode, that demonstrates how each LLM token is generated from a sequence of previous tokens. This simulation runs on a general purpose CPU, without any hardware acceleration.

Figure 1 - User interface

Video: Token by token inference

AI inference is about predicting the next word based on the previous word. In the following video you can see inference performed step-by-step on the CPU. You will notice, that after a word (token) is generated it is put to the end of the input to generate the next token. Check out the video:

Video: Token by token inference in the Debugger.

The following video you can see how a debugger can be used to undersand the procedure. You can see the token represented by integers. At the end of the video when step into is done, you can see, that this LLM (LLama3.2 1B) has 16 layers.

More information


Projects | Books | Printouts | On-line lectures | Presentations