C++ machine learning, compiled into your binary.

On-device inference for edge and WebAssembly — no inference server, no GPU, no Python.

Three claims, in order — each one gets less hand-wavy the further you scroll.

uchen::Model kClassifier = Input<28*28> | Linear<128> | Relu | Linear<10> | Logits;

Example — an MNIST classifier. The Dots game above is this same framework, compiled to WebAssembly and running in your tab.

01 — Define model at C++ compile time

A library you link with, not an alien ecosystem.

  • Defined in C++ source

    The model is C++, alongside the rest of your project. Change a constant, the architecture updates.

  • Native dependency

    Link the library, include the headers. It's already part of your toolchain.

  • No heap on the inference path

    Memory is known at compile time. Predictable, no surprises at runtime.

  • SIMD via Google Highway

    Portable vectorization in the math core — linear, matrix, and convolution kernels. AVX, AVX2, ARM NEON, and WebAssembly SIMD.

Dots — one constant, the whole model resizes change kBoardSize, recompile
// The dots game's board size is also the model's input shape.
// Change kBoardSize once; the conv input, the policy head, and
// the parameter layout all resize at compile time.
constexpr size_t kBoardSize = 8;  // 8×8 board

inline constexpr uchen::Model kPolicy =
    uchen::layers::Input<
        uchen::convolution::ConvolutionInput<
            kBoardChannels, kBoardSize, kBoardSize>> |
    uchen::layers::Conv2dWithFilter<32, 3, 3, 1, 1>(
        uchen::convolution::Flatten<
            uchen::convolution::ReluFilter>()) |
    uchen::layers::Linear<kBoardSize * kBoardSize> |  // one logit per cell
    uchen::layers::Logits;

02 — Training is a build step

Small models with the math you actually need. Don't pay for what you don't use, taken to heart.

  • Layers out of the box

    Linear, Conv2d, RNN, activations (ReLU / sigmoid / tanh), Logits, Softmax, Embeddings.

  • Training core

    Backpropagation, SGD, Adam, Kaiming He init, squared loss, cross entropy, Deep Q loss.

  • Multi-head models

    Shared trunk with split heads — policy/value workflows are a first-class composition, not glue code.

  • MCTS in the box

    uchen/learning/mcts. Game-agnostic Oracle interface; deterministic given (oracle, root, config).

RNN — character model layers::Rnn wraps any pipeline
// layers::Rnn wraps any pipeline in a recurrent loop,
// threading a fixed-size hidden state across each step
// of an iterable input.
constexpr uchen::Model kNameRnn =
    uchen::layers::Rnn<internal::Input, 50>(
        uchen::layers::Linear<10> | uchen::layers::Relu |
        uchen::layers::Linear<10> | uchen::layers::Relu) |
    uchen::layers::Linear<'z' - 'a' + 2> | uchen::layers::Logits;

03 — Inference ships with the product

One artifact. No model server, no Python on the target.

  • No model server

    The model is part of the application binary. Nothing to host, route, or page on at 3 a.m.

  • No GPU pool, no Python

    CPU-first execution with SIMD. Pure C++ runtime — nothing to pip-install on the target machine.

  • One artifact

    Native binary or WebAssembly module — same C++ framework, same model definition, same training.

  • Privacy by deployment shape

    Inference stays on-device. No data leaves the user's machine, because there's nothing to send it to.

One model, where you need it

Train your C++ model once, then run it where you need it — WebAssembly in the browser, native on iOS/Android. No rewrite, no conversion, no second model to keep in sync. Same source recompiles where you need it.

WASM - Browsers and edge
Compiled to WebAssembly SIMD, runs in the browser or edge workers.
Native apps (C++)
Linked into your binary — in-process, no IPC.
Game engines
Per-frame inference inside the engine loop.
Embedded targets
Fixed memory footprint, no runtime dependency.
Cloud instances
Deploy to any environment, CPU is required.
Privacy-local features
On-device inference; no data leaves the machine.

Project Status

Plain accounting — what's stable, what's in progress, and what's explicitly out. The framework is scoped on purpose.

In the box, stable

  • C++20 runtime + training core (Bazel build)
  • Layers: Linear, Conv2d, RNN, activations, Logits, Softmax, Embeddings
  • Optimizers: SGD, Adam · losses: squared, cross entropy, Deep Q
  • Multi-head models + MCTS
  • Optimized for AVX, AVX2, ARM NEON, and WebAssembly SIMD via Google Highway — covers x86, Apple Silicon, ARM Linux, and browsers

In progress

  • More demos, driving the requirements
  • Generalization of the training pipeline beyond current demos
  • More documentation, articles and examples

Not in scope

  • GPU backend — focus is on edge and embedded
  • Import from ONNX and other model formats
  • Opening the code - unable to open source yet due to personal reasons. September 2025 snapshot is on the GitHub.

My notes from building UchenML

C++ machine learning compiled into real products — plus the C++ and performance rabbit holes I fall into around it. Technical, irregular by design.

Double opt-in — I'll send a confirmation email first. Unsubscribe in one click.