C++ machine learning, compiled into your binary.

On-device inference for edge and WebAssembly — no inference server, no GPU, no Python.

Run it in your browser See the C++/WASM path →

Three claims, in order — each one gets less hand-wavy the further you scroll.

uchen::Model kClassifier = Input<28*28> | Linear<128> | Relu | Linear<10> | Logits;

Example — an MNIST classifier. The Dots game above is this same framework, compiled to WebAssembly and running in your tab.

01 — Define model at C++ compile time

A library you link with, not an alien ecosystem.

Defined in C++ source
The model is C++, alongside the rest of your project. Change a constant, the architecture updates.
Native dependency
Link the library, include the headers. It's already part of your toolchain.
No heap on the inference path
Memory is known at compile time. Predictable, no surprises at runtime.
SIMD via Google Highway
Portable vectorization in the math core — linear, matrix, and convolution kernels. AVX, AVX2, ARM NEON, and WebAssembly SIMD.

Dots — one constant, the whole model resizes change kBoardSize, recompile

// The dots game's board size is also the model's input shape.
// Change kBoardSize once; the conv input, the policy head, and
// the parameter layout all resize at compile time.
constexpr size_t kBoardSize = 8;  // 8×8 board

inline constexpr uchen::Model kPolicy =
    uchen::layers::Input<
        uchen::convolution::ConvolutionInput<
            kBoardChannels, kBoardSize, kBoardSize>> |
    uchen::layers::Conv2dWithFilter<32, 3, 3, 1, 1>(
        uchen::convolution::Flatten<
            uchen::convolution::ReluFilter>()) |
    uchen::layers::Linear<kBoardSize * kBoardSize> |  // one logit per cell
    uchen::layers::Logits;

02 — Training is a build step

Small models with the math you actually need. Don't pay for what you don't use, taken to heart.

Layers out of the box
Linear, Conv2d, RNN, activations (ReLU / sigmoid / tanh), Logits, Softmax, Embeddings.
Training core
Backpropagation, SGD, Adam, Kaiming He init, squared loss, cross entropy, Deep Q loss.
Multi-head models
Shared trunk with split heads — policy/value workflows are a first-class composition, not glue code.
MCTS in the box
uchen/learning/mcts. Game-agnostic Oracle interface; deterministic given (oracle, root, config).

RNN — character model layers::Rnn wraps any pipeline

// layers::Rnn wraps any pipeline in a recurrent loop,
// threading a fixed-size hidden state across each step
// of an iterable input.
constexpr uchen::Model kNameRnn =
    uchen::layers::Rnn<internal::Input, 50>(
        uchen::layers::Linear<10> | uchen::layers::Relu |
        uchen::layers::Linear<10> | uchen::layers::Relu) |
    uchen::layers::Linear<'z' - 'a' + 2> | uchen::layers::Logits;

03 — Inference ships with the product

One artifact. No model server, no Python on the target.

No model server
The model is part of the application binary. Nothing to host, route, or page on at 3 a.m.
No GPU pool, no Python
CPU-first execution with SIMD. Pure C++ runtime — nothing to pip-install on the target machine.
One artifact
Native binary or WebAssembly module — same C++ framework, same model definition, same training.
Privacy by deployment shape
Inference stays on-device. No data leaves the user's machine, because there's nothing to send it to.

One model, where you need it

Train your C++ model once, then run it where you need it — WebAssembly in the browser, native on iOS/Android. No rewrite, no conversion, no second model to keep in sync. Same source recompiles where you need it.

WASM - Browsers and edge: Compiled to WebAssembly SIMD, runs in the browser or edge workers.
Native apps (C++): Linked into your binary — in-process, no IPC.
Game engines: Per-frame inference inside the engine loop.
Embedded targets: Fixed memory footprint, no runtime dependency.
Cloud instances: Deploy to any environment, CPU is required.
Privacy-local features: On-device inference; no data leaves the machine.

Project Status

Plain accounting — what's stable, what's in progress, and what's explicitly out. The framework is scoped on purpose.

In the box, stable

C++20 runtime + training core (Bazel build)
Layers: Linear, Conv2d, RNN, activations, Logits, Softmax, Embeddings
Optimizers: SGD, Adam · losses: squared, cross entropy, Deep Q
Multi-head models + MCTS
Optimized for AVX, AVX2, ARM NEON, and WebAssembly SIMD via Google Highway — covers x86, Apple Silicon, ARM Linux, and browsers

In progress

More demos, driving the requirements
Generalization of the training pipeline beyond current demos
More documentation, articles and examples

Not in scope

GPU backend — focus is on edge and embedded
Import from ONNX and other model formats
Opening the code - unable to open source yet due to personal reasons. September 2025 snapshot is on the GitHub.

My notes from building UchenML

C++ machine learning compiled into real products — plus the C++ and performance rabbit holes I fall into around it. Technical, irregular by design.

Double opt-in — I'll send a confirmation email first. Unsubscribe in one click.