Why I Built a Native ML Inference Engine in Rust
Adding ML to an app shouldn't require Python, ONNX, CUDA, or 6.8GB of dependencies. Kjarni is a single native library that runs transformer models anywhere — Rust, C#, Go, or the command line.
Adding ML to Your App Shouldn't Require Python
I needed semantic search in a Rust application. Users would search for "doctor" and the system needed to find documents mentioning "physician" or "medical practitioner." The standard approach is sentence embeddings — turn text into vectors that capture meaning, then compare them.
The problem wasn't the algorithm. The problem was getting it to run.
The Python Tax
In Python, generating embeddings is straightforward:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
vector = model.encode("hello world")
Three lines. It works. But installing sentence-transformers pulls in PyTorch, tokenizers, NumPy, and everything they depend on. A fresh virtual environment balloons to 6.8 GB — mostly PyTorch alone.
For a Python application that already lives in that ecosystem, this is fine. For anything else — a C# backend, a Go microservice, a Rust CLI tool, a desktop application — it means shipping a Python runtime alongside your application, managing virtual environments in production, and accepting a multi-gigabyte deployment for what should be a single function call.
The ONNX Detour
ONNX Runtime seemed like the answer. Export the model to ONNX format, load it from Rust using the ort crate. No Python at runtime.
pub fn new(model_path: &str, tokenizer_path: &str) -> Result<Self> {
let environment = ort::environment::init()
.with_execution_providers([CUDAExecutionProvider::default().build()])
.commit()?;
}
It worked. But ort pulled in 80+ crates, expanded the release binary to 350 MB, and relied on system libraries: libstdc++, libpthread, libm, libc. Then came the OpenSSL problem:
error: OpenSSL 3.3 required
$ openssl version
OpenSSL 1.1.1k # Can't upgrade — would break RHEL dependencies
The C++ dependencies wanted OpenSSL 3.3. The production server ran RHEL with OpenSSL 1.1. Upgrading would break half the system. The binary worked on my machine and nowhere else.
This is the dependency trap. You trade Python's runtime dependency for C++'s build-time dependency, and end up with a different set of problems that are harder to debug.
Building It From Scratch
The core operations of a transformer model aren't complicated. Matrix multiplication, layer normalization, attention, softmax. The model architecture is well-documented. The hard part isn't the math — it's getting the infrastructure (tokenizer, weight loading, memory layout, SIMD dispatch) right without pulling in an existing runtime.
So I built one. Pure Rust. No ONNX. No external C++ libraries. No system dependencies beyond libc.
The result is Kjarni — a native ML inference engine that compiles to a single shared library. It runs transformer models for embeddings, classification, reranking, search, and text generation.
What it looks like in practice
Rust:
use kjarni::Embedder;
let embedder = Embedder::new("minilm-l6-v2")?;
let vector = embedder.encode("hello world")?;
let similarity = embedder.similarity("doctor", "physician")?;
// 0.8598
C# / .NET:
using Kjarni;
using var embedder = new Embedder("minilm-l6-v2");
Console.WriteLine(embedder.Similarity("doctor", "physician"));
// 0.8598
CLI:
$ kjarni similarity "doctor" "physician"
0.8598
$ kjarni classify "I love this product!"
positive 98.50% ██████████████████████████████████████
$ echo "Terrible quality" | kjarni classify --model toxic-bert
toxic 2.79% █
insult 1.14%
Models download on first use and cache locally. No setup, no configuration, no API keys.
What's Under the Hood
Kjarni doesn't wrap ONNX, LibTorch, or any external inference engine. The runtime is written in Rust from the ground up:
- Hand-tuned SIMD kernels — AVX2/FMA on x86, NEON on ARM. Matrix multiplication, layer norm, and softmax are the hot paths; these are written to use the full width of the vector registers.
- GPU compute via WebGPU — custom WGSL shaders for GPU inference. No CUDA dependency. Works on Vulkan (Linux/Windows) and will support Metal (macOS) in a future release.
- Zero-copy model loading — models are memory-mapped (
mmap) so the OS manages paging. Loading a 90MB model doesn't allocate 90MB of heap. - BF16 compute path — bfloat16 arithmetic where the hardware supports it, cutting memory bandwidth in half on the attention layers.
- Quantization — Q4, Q6, Q8 for reduced model size and faster inference on CPU.
The entire engine, including the tokenizer, compiles to a single .so (Linux) or .dll (Windows). The only system dependency is glibc 2.17 — which means it runs on anything from a modern Ubuntu to RHEL 7.
Supported Models
Kjarni currently supports:
| Task | Models |
|---|---|
| Embeddings | MiniLM-L6-v2, MPNet-base-v2 |
| Sentiment | RoBERTa, DistilBERT, multilingual BERT |
| Emotion | DistilRoBERTa (7-class), RoBERTa (28-class) |
| Toxicity | Toxic-BERT |
| Reranking | MiniLM cross-encoder |
| Generation | Llama, Qwen2, Mistral, Phi-3 |
| Seq2seq | T5, BART |
| Speech | Whisper |
Encoder, decoder, and seq2seq architectures are all supported. Generation model bindings are intentionally not exposed in the initial release and will ship in a future version.
Language Bindings
The Rust core exposes a C ABI through kjarni-ffi, which makes it straightforward to call from any language with a C FFI:
- C# / .NET — available on NuGet.
dotnet add package Kjarni. - Go — native bindings via cgo.
- CLI — a standalone binary that reads from stdin, writes JSON, and pipes like any UNIX tool.
Adding a new language binding means writing a thin wrapper around the C API. The inference logic stays in Rust.
The Dependency Comparison
| Python | ONNX Runtime | Kjarni | |
|---|---|---|---|
| Runtime size | 6.8 GB | 350 MB | 5 MB + model |
| System deps | Python, pip, venv | libstdc++, OpenSSL | glibc 2.17 |
| GPU | CUDA required | CUDA required | WebGPU (Vulkan) |
| Deployment | virtualenv + requirements.txt | .so/.dll + ONNX model | single .so/.dll |
| Languages | Python only | C++, Python, C#, Java | Rust, C#, Go, CLI |
Try It
C# / .NET:
dotnet add package Kjarni
CLI:
curl -fsSL https://kjarni.ai/install.sh | sh
Rust / Build from source:
git clone https://github.com/olafurjohannsson/kjarni
cargo build --release -p kjarni-cli
The source is on GitHub. Licensed MIT or Apache-2.0.