August 26, 2025 · 4 min read

Why I Built a Native ML Inference Engine in Rust

Adding ML to an app shouldn't require Python, ONNX, CUDA, or 6.8GB of dependencies. Kjarni is a single native library that runs transformer models anywhere — Rust, C#, Go, or the command line.

Adding ML to Your App Shouldn't Require Python

I needed semantic search in a Rust application. Users would search for "doctor" and the system needed to find documents mentioning "physician" or "medical practitioner." The standard approach is sentence embeddings — turn text into vectors that capture meaning, then compare them.

The problem wasn't the algorithm. The problem was getting it to run.

The Python Tax

In Python, generating embeddings is straightforward:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vector = model.encode("hello world")

Three lines. It works. But installing sentence-transformers pulls in PyTorch, tokenizers, NumPy, and everything they depend on. A fresh virtual environment balloons to 6.8 GB — mostly PyTorch alone.

For a Python application that already lives in that ecosystem, this is fine. For anything else — a C# backend, a Go microservice, a Rust CLI tool, a desktop application — it means shipping a Python runtime alongside your application, managing virtual environments in production, and accepting a multi-gigabyte deployment for what should be a single function call.

The ONNX Detour

ONNX Runtime seemed like the answer. Export the model to ONNX format, load it from Rust using the ort crate. No Python at runtime.

pub fn new(model_path: &str, tokenizer_path: &str) -> Result<Self> {
    let environment = ort::environment::init()
        .with_execution_providers([CUDAExecutionProvider::default().build()])
        .commit()?;
}

It worked. But ort pulled in 80+ crates, expanded the release binary to 350 MB, and relied on system libraries: libstdc++, libpthread, libm, libc. Then came the OpenSSL problem:

error: OpenSSL 3.3 required
$ openssl version
OpenSSL 1.1.1k  # Can't upgrade — would break RHEL dependencies

The C++ dependencies wanted OpenSSL 3.3. The production server ran RHEL with OpenSSL 1.1. Upgrading would break half the system. The binary worked on my machine and nowhere else.

This is the dependency trap. You trade Python's runtime dependency for C++'s build-time dependency, and end up with a different set of problems that are harder to debug.

Building It From Scratch

The core operations of a transformer model aren't complicated. Matrix multiplication, layer normalization, attention, softmax. The model architecture is well-documented. The hard part isn't the math — it's getting the infrastructure (tokenizer, weight loading, memory layout, SIMD dispatch) right without pulling in an existing runtime.

So I built one. Pure Rust. No ONNX. No external C++ libraries. No system dependencies beyond libc.

The result is Kjarni — a native ML inference engine that compiles to a single shared library. It runs transformer models for embeddings, classification, reranking, search, and text generation.

What it looks like in practice

Rust:

use kjarni::Embedder;

let embedder = Embedder::new("minilm-l6-v2")?;
let vector = embedder.encode("hello world")?;
let similarity = embedder.similarity("doctor", "physician")?;
// 0.8598

C# / .NET:

using Kjarni;

using var embedder = new Embedder("minilm-l6-v2");
Console.WriteLine(embedder.Similarity("doctor", "physician"));
// 0.8598

CLI:

$ kjarni similarity "doctor" "physician"
0.8598

$ kjarni classify "I love this product!"
positive   98.50%  ██████████████████████████████████████

$ echo "Terrible quality" | kjarni classify --model toxic-bert
toxic        2.79%  █
insult       1.14%

Models download on first use and cache locally. No setup, no configuration, no API keys.

What's Under the Hood

Kjarni doesn't wrap ONNX, LibTorch, or any external inference engine. The runtime is written in Rust from the ground up:

Hand-tuned SIMD kernels — AVX2/FMA on x86, NEON on ARM. Matrix multiplication, layer norm, and softmax are the hot paths; these are written to use the full width of the vector registers.
GPU compute via WebGPU — custom WGSL shaders for GPU inference. No CUDA dependency. Works on Vulkan (Linux/Windows) and will support Metal (macOS) in a future release.
Zero-copy model loading — models are memory-mapped (mmap) so the OS manages paging. Loading a 90MB model doesn't allocate 90MB of heap.
BF16 compute path — bfloat16 arithmetic where the hardware supports it, cutting memory bandwidth in half on the attention layers.
Quantization — Q4, Q6, Q8 for reduced model size and faster inference on CPU.

The entire engine, including the tokenizer, compiles to a single .so (Linux) or .dll (Windows). The only system dependency is glibc 2.17 — which means it runs on anything from a modern Ubuntu to RHEL 7.

Supported Models

Kjarni currently supports:

Task	Models
Embeddings	MiniLM-L6-v2, MPNet-base-v2
Sentiment	RoBERTa, DistilBERT, multilingual BERT
Emotion	DistilRoBERTa (7-class), RoBERTa (28-class)
Toxicity	Toxic-BERT
Reranking	MiniLM cross-encoder
Generation	Llama, Qwen2, Mistral, Phi-3
Seq2seq	T5, BART
Speech	Whisper

Encoder, decoder, and seq2seq architectures are all supported. Generation model bindings are intentionally not exposed in the initial release and will ship in a future version.

Language Bindings

The Rust core exposes a C ABI through kjarni-ffi, which makes it straightforward to call from any language with a C FFI:

C# / .NET — available on NuGet. dotnet add package Kjarni.
Go — native bindings via cgo.
CLI — a standalone binary that reads from stdin, writes JSON, and pipes like any UNIX tool.

Adding a new language binding means writing a thin wrapper around the C API. The inference logic stays in Rust.

The Dependency Comparison

	Python	ONNX Runtime	Kjarni
Runtime size	6.8 GB	350 MB	5 MB + model
System deps	Python, pip, venv	libstdc++, OpenSSL	glibc 2.17
GPU	CUDA required	CUDA required	WebGPU (Vulkan)
Deployment	virtualenv + requirements.txt	.so/.dll + ONNX model	single .so/.dll
Languages	Python only	C++, Python, C#, Java	Rust, C#, Go, CLI

Try It

C# / .NET:

dotnet add package Kjarni

CLI:

curl -fsSL https://kjarni.ai/install.sh | sh

Rust / Build from source:

git clone https://github.com/olafurjohannsson/kjarni
cargo build --release -p kjarni-cli

The source is on GitHub. Licensed MIT or Apache-2.0.

rust machine-learning inference embeddings transformers nlp kjarni

Back to all posts