Run AI Models Entirely in the Browser Using WebAssembly + ONNX Runtime (No Backend Required)
Most devs assume running AI models requires Python, GPUs, or cloud APIs. But modern browsers are capable of running full neural network inference, using ONNX Runtime Web with WebAssembly — no backend, no cloud, no server. In this tutorial, we’ll build a fully client-side AI inference engine that runs a real ONNX model (like sentiment analysis or image classification) entirely in the browser using WebAssembly — perfect for privacy-focused tools, offline workflows, or local-first apps. Step 1: Choose a Small ONNX Model To keep things performant, pick a lightweight ONNX model. You can use: TinyBERT SqueezeNet Let’s use a text model for simplicity — TinyBERT. Download the ONNX model: wget https://huggingface.co/onnx/tinybert-distilbert-base-uncased/resolve/main/model.onnx Store this file in your public assets directory (e.g., public/models/model.onnx). Step 2: Set Up ONNX Runtime Web Install the ONNX Runtime Web runtime: npm install onnxruntime-web Then, initialize the inference session in your frontend code: import * as ort from "onnxruntime-web"; let session; async function initModel() { session = await ort.InferenceSession.create("/models/model.onnx", { executionProviders: ["wasm"], }); } This loads the ONNX model into a WASM-based runtime, running entirely in-browser. Step 3: Tokenize Input Text (No HuggingFace Needed) ONNX models expect pre-tokenized inputs. Instead of using HuggingFace or Python tokenizers, we’ll use a compact JavaScript tokenizer like bert-tokenizer: npm install bert-tokenizer Then tokenize user input: import BertTokenizer from "bert-tokenizer"; const tokenizer = new BertTokenizer(); const { input_ids, attention_mask } = tokenizer.encode("this is great!"); Prepare inputs for ONNX: const input = { input_ids: new ort.Tensor("int64", BigInt64Array.from(input_ids), [1, input_ids.length]), attention_mask: new ort.Tensor("int64", BigInt64Array.from(attention_mask), [1, input_ids.length]) }; Step 4: Run Inference in the Browser Now run the model, right in the user's browser: const results = await session.run(input); const logits = results.logits.data; Interpret the logits for your task (e.g., choose the argmax index for classification). You’ve just run a transformer-based AI model with zero server calls. Step 5: Add WebAssembly Optimizations (Optional) ONNX Runtime also supports WebAssembly SIMD and multithreading if the browser supports it: await ort.env.wasm.setNumThreads(2); ort.env.wasm.simd = true; Enable these for dramatically better inference speed. ✅ Pros:
Most devs assume running AI models requires Python, GPUs, or cloud APIs. But modern browsers are capable of running full neural network inference, using ONNX Runtime Web with WebAssembly — no backend, no cloud, no server.
In this tutorial, we’ll build a fully client-side AI inference engine that runs a real ONNX model (like sentiment analysis or image classification) entirely in the browser using WebAssembly — perfect for privacy-focused tools, offline workflows, or local-first apps.
Step 1: Choose a Small ONNX Model
To keep things performant, pick a lightweight ONNX model. You can use:
Let’s use a text model for simplicity — TinyBERT.
Download the ONNX model:
wget https://huggingface.co/onnx/tinybert-distilbert-base-uncased/resolve/main/model.onnx
Store this file in your public assets directory (e.g., public/models/model.onnx
).
Step 2: Set Up ONNX Runtime Web
Install the ONNX Runtime Web runtime:
npm install onnxruntime-web
Then, initialize the inference session in your frontend code:
import * as ort from "onnxruntime-web";
let session;
async function initModel() {
session = await ort.InferenceSession.create("/models/model.onnx", {
executionProviders: ["wasm"],
});
}
This loads the ONNX model into a WASM-based runtime, running entirely in-browser.
Step 3: Tokenize Input Text (No HuggingFace Needed)
ONNX models expect pre-tokenized inputs. Instead of using HuggingFace or Python tokenizers, we’ll use a compact JavaScript tokenizer like bert-tokenizer:
npm install bert-tokenizer
Then tokenize user input:
import BertTokenizer from "bert-tokenizer";
const tokenizer = new BertTokenizer();
const { input_ids, attention_mask } = tokenizer.encode("this is great!");
Prepare inputs for ONNX:
const input = {
input_ids: new ort.Tensor("int64", BigInt64Array.from(input_ids), [1, input_ids.length]),
attention_mask: new ort.Tensor("int64", BigInt64Array.from(attention_mask), [1, input_ids.length])
};
Step 4: Run Inference in the Browser
Now run the model, right in the user's browser:
const results = await session.run(input);
const logits = results.logits.data;
Interpret the logits for your task (e.g., choose the argmax index for classification).
You’ve just run a transformer-based AI model with zero server calls.
Step 5: Add WebAssembly Optimizations (Optional)
ONNX Runtime also supports WebAssembly SIMD and multithreading if the browser supports it:
await ort.env.wasm.setNumThreads(2);
ort.env.wasm.simd = true;
Enable these for dramatically better inference speed.
✅ Pros: