fusevm — Documentation

ONE VM TO RUN THEM ALL.

FUSED SUPERINSTRUCTIONS. EXTENSION DISPATCH. 3-TIER CRANELIFT JIT.

224

opcodes

17

op sections

11

fused superinstructions

29

shell ops

61

AWK ops

23,781

production lines

7,636

#[test] fns

3

JIT tiers

14

language frontends

fusevm is the shared execution engine behind fourteen language frontends — zshrs, strykelang, awkrs, vimlrs, elisprs, rubylang, arb, pythonrs, phplang, and node-js (ten at full toolchain parity), plus the newer JVM-language slices javars, kotlinrs, scalars, and groovyrs. Any language frontend compiles to the same Op enum and gets fused hot-loop dispatch, extension opcode tables, stack-based execution with slot-indexed fast paths, and JIT eligibility analysis — for free. The VM doesn’t care which language produced the bytecodes.

stryke registers ~450 extended ops. zshrs registers ~20. awkrs registers ~95. elisprs registers 10. vimlrs takes the other route — ~510 builtin IDs through CallBuiltin rather than extended ops. They don’t conflict — each frontend owns its own ID space via Extended(u16, u8). Process control ops (pipes, redirects, globs, file tests) are first-class because multiple frontends need them.

[0x00] ARCHITECTURE

┌─────────────────────────────────────────────────────────────────┐ │ LANGUAGE FRONTENDS │ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ strykelang │ │ zshrs │ │ awkrs │ │ │ │ Perl 5 compat │ │ shell compiler│ │ awk compiler │ │ │ │ ~450 ext ops │ │ ~20 ext ops │ │ ~95 ext ops │ │ │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │ │ compile │ │ │ │ └──────────────────┼──────────────────┘ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ ChunkBuilder │ │ │ │ .emit(Op, line) │ │ │ │ .build() → Chunk │ │ │ └─────────┬───────────┘ │ │ │ │ │ ┌─────────────┴─────────────┐ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌────────────────────────┐ │ │ │ VM::run() │ ◀──────▶│ JitCompiler tiers │ │ │ │ match- │ trace │ ┌──────────────────┐ │ │ │ │ dispatch │ invoke │ │ Linear (instant)│ │ │ │ │ interpreter │ ◀─────▶│ │ Block (≥10x) │ │ │ │ │ │ deopt │ │ Tracing (≥50x) │ │ │ │ └──────────────┘ │ └──────────────────┘ │ │ │ │ Cranelift 0.130 │ │ │ │ native x86-64/aarch64 │ │ │ └────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘

The Chunk is the unit of compiled bytecodes: an op array, constant pool, name pool, line-number table, and slot count. The ChunkBuilder emits ops one at a time and resolves forward jumps on .build(). The VM executes via a match-dispatch loop over the op array. The JIT compiler analyzes chunks for eligibility and compiles hot paths to native code via Cranelift.

Execution tiers — one semantic source of truth

fusevm has four ways to execute a chunk, and they all agree on what every op means because they route through — or fall back to — a single function, VM::exec_op (src/vm.rs), documented in-source as "the single source of truth for op semantics." The interpreter loop calls it per op; the JIT and AOT specialize performance for the hot scalar subset and bail or deopt back to exec_op for everything else. A new op is implemented once and every tier inherits it.

Tier	Entry	How it runs an op
Interpreter	`VM::run`	Dispatch loop calls `exec_op(ops, ip, …)` per op; the returned `ExecFlow` says continue or terminate
Linear / Block / Tracing JIT	`JitCompiler` (`src/jit.rs`)	Emits specialized Cranelift IR for the eligible integer/float/slot subset; anything ineligible bails or deopts back to the interpreter — back to `exec_op`
AOT	`aot::compile_object` (`src/aot.rs`)	Native driver, one Cranelift block per op; unspecialized ops call `exec_op` through the `extern "C"` `fusevm_aot_exec_op` shim (`VM::aot_exec_op`)

Because semantics never fork, the JIT/AOT specialize the scalar (int/float/bool/slot) ops that pay off and lean on exec_op for the string/array/hash/host tail. Deopt is just "resume exec_op at this ip": on a tracing-JIT guard miss, materialize_deopt_frames (src/vm.rs) rebuilds the value stack (stack_buf + per-entry stack_kinds, so floats bit-cast back through f64::from_bits) and the inlined call frames (return_ip + slot values), so the interpreter picks up mid-loop with byte-identical state.

[0x01] QUICK START

# interpreter only
cargo add fusevm
# with Cranelift JIT (linear, block, tracing tiers)
cargo add fusevm --features jit
# JIT + persistent on-disk native-code cache
cargo add fusevm --features jit-disk-cache

Feature	Effect
`default`	None — pure-Rust interpreter, runtime deps `serde` / `tracing` / `glob` / `chrono`
`jit`	Cranelift-backed native JIT (linear, block, and tracing tiers). Pulls in Cranelift 0.130
`jit-disk-cache`	Persists compiled native code to `~/.cache/fusevm-jit` so codegen is skipped across process restarts. Implies `jit`; on by default once enabled. Pulls in `libc` for executable memory mapping

use fusevm::{Op, ChunkBuilder, VM, VMResult, Value};

let mut b = ChunkBuilder::new();
b.emit(Op::LoadInt(40), 1);
b.emit(Op::LoadInt(2), 1);
b.emit(Op::Add, 1);

let mut vm = VM::new(b.build());
match vm.run() {
    VMResult::Ok(val) => println!("result: {}", val.to_str()),  // "42"
    VMResult::Error(e) => eprintln!("error: {}", e),
    VMResult::Halted => {}
}

// Extension handler — register language-specific ops
let mut vm = VM::new(chunk);
vm.set_extension_handler(Box::new(|vm, id, arg| {
    match id {
        0 => { /* your custom op */ }
        1 => { /* another custom op */ }
        _ => {}
    }
}));

[0x02] FUSED SUPERINSTRUCTIONS

The performance secret. The compiler detects hot loop patterns and emits single ops instead of multi-op sequences. Each fused op eliminates N−1 dispatch cycles, stack pushes, and branch mispredictions from the hot path.

Fused Op	Replaces	Effect
`AccumSumLoop(sum, i, limit)`	`GetSlot + GetSlot + Add + SetSlot + PreInc + NumLt + JumpIfFalse`	Entire counted sum loop in one dispatch
`SlotIncLtIntJumpBack(slot, limit, target)`	`PreIncSlot + SlotLtIntJumpIfFalse`	Loop backedge in one dispatch
`ConcatConstLoop(const, s, i, limit)`	`LoadConst + ConcatAppendSlot + SlotIncLtIntJumpBack`	String append loop in one dispatch
`PushIntRangeLoop(arr, i, limit)`	`GetSlot + PushArray + ArrayLen + Pop + SlotIncLtIntJumpBack`	Array push loop in one dispatch
`AddAssignSlotVoid(a, b)`	`GetSlot + GetSlot + Add + SetSlot`	Void-context add-assign, no stack traffic
`PreIncSlotVoid(slot)`	`GetSlot + Inc + SetSlot`	Void-context increment, no stack traffic
`SlotLtIntJumpIfFalse(slot, int, target)`	`GetSlot + LoadInt + NumLt + JumpIfFalse`	Fused compare + branch, no stack traffic
`PreIncSlot(slot)`	`GetSlot + Inc + SetSlot + GetSlot`	Slot pre-increment with push

[0x03] OP CATEGORIES

224 opcodes across 17 categories in src/op.rs. Every op is ≤24 bytes for cache-friendly dispatch.

Constants & Stack

~12 ops

LoadInt LoadFloat LoadConst LoadTrue LoadFalse LoadUndef Pop Dup Dup2 Swap Rot

Variables

~8 ops — name-indexed + slot-indexed fast paths

GetVar SetVar DeclareVar GetSlot SetSlot SlotArrayGet SlotArraySet

Arrays & Hashes

~25 ops — full collection primitives

ArrayPush ArrayPop ArrayShift ArrayLen MakeArray HashGet HashSet HashDelete HashKeys HashValues MakeHash

Arithmetic

9 ops

Add Sub Mul Div Mod Pow Negate Inc Dec

String

3 ops

Concat StringRepeat StringLen

Comparison

~14 ops — numeric + string + three-way

NumEq NumLt NumGe Spaceship StrEq StrLt StrCmp

Logical & Bitwise

9 ops

LogNot LogAnd LogOr BitAnd BitOr BitXor BitNot Shl Shr

Control Flow

5 ops — including short-circuit keep variants

Jump JumpIfTrue JumpIfFalse JumpIfTrueKeep JumpIfFalseKeep

Functions & Scope

5 ops

Call Return ReturnValue PushFrame PopFrame

Higher-Order

5 ops — block-based functional primitives

MapBlock GrepBlock SortBlock SortDefault ForEachBlock

I/O

3 ops

Print PrintLn ReadLine

Collections

2 ops — range generation

Range RangeStep

Fused

11 ops — hot-loop superinstructions (see [0x02] Fused Superinstructions)

AccumSumLoop SlotIncLtIntJumpBack ConcatConstLoop PushIntRangeLoop PreIncSlot PostIncSlot PreDecSlot PostDecSlot

Builtins

1 op — CallBuiltin(id, argc) dispatches 140 builtin IDs in shell_builtins.rs

CallBuiltin

Shell Ops

29 ops — first-class process control (see [0x04])

Exec PipelineBegin Redirect Glob TestFile CmdSubst RegexMatch

AWK Ops

61 ops — first-class Op::Awk* variants (see [0x05])

AwkFieldGet AwkPrint AwkStrtonum AwkDivJit AwkModJit AwkGensub AwkOrd AwkChr AwkMkbool AwkIntdiv

Extension

2 ops — frontend-specific dispatch (see [0x06])

Extended(u16, u8) ExtendedWide(u16, usize)

[0x04] SHELL OPS

Process control is universal enough that multiple frontends need it. These are first-class ops, not extensions — any frontend that targets fusevm gets pipes, redirects, globs, process substitution, and file tests for free.

Op	Description
`Exec(n)`	Spawn external command — pop N args, exec, push exit status
`ExecBg(n)`	Spawn background — like Exec but don’t wait
`PipelineBegin(n)` / `PipelineStage` / `PipelineEnd`	Set up, wire, and wait for N-stage pipeline
`Redirect(fd, op)`	Redirect fd — write, append, read, clobber, dup, both
`HereDoc(idx)` / `HereString`	Here-document from constant pool / here-string from stack
`CmdSubst(idx)`	Command substitution — capture stdout of subprogram
`SubshellBegin` / `SubshellEnd`	Isolate scope for subshell execution
`ProcessSubIn(idx)` / `ProcessSubOut(idx)`	Process substitution `<(cmd)` / `>(cmd)` — push FIFO path
`Glob` / `GlobRecursive`	Glob expand pattern from stack — recursive variant is parallel
`TestFile(test)`	File test: `-f` `-d` `-r` `-w` `-x` `-e` `-s` `-L` `-S` `-p` `-b` `-c`
`SetStatus` / `GetStatus`	Last exit status `$?`
`TrapSet(idx)` / `TrapCheck`	Signal trap handler registration + periodic trap check
`ExpandParam(mod)`	18 parameter expansion modifiers: `${:-}` `${:=}` `${:?}` `${:+}` `${#}` `${/}` `${^^}` etc.
`WordSplit` / `BraceExpand` / `TildeExpand`	IFS word split, brace expansion, tilde expansion

Shell Host trait

Shell-specific runtime ops (Glob, TildeExpand, BraceExpand, WordSplit, ExpandParam, CmdSubst, ProcessSubIn/Out, Redirect, HereDoc, HereString, PipelineBegin/Stage/End, SubshellBegin/End, TrapSet/TrapCheck, WithRedirectsBegin/End, CallFunction, StrMatch, RegexMatch) dispatch through the ShellHost trait. The frontend (zshrs) provides a real implementation; without one the VM uses minimal stubs that keep stack discipline correct. Sub-execution (command substitution, process substitution, trap handlers) is delivered to the host as &Chunk references taken from the parent's sub_chunks table — build them with ChunkBuilder::add_sub_chunk(sub) -> u16 and reference by index in Op::CmdSubst(idx), Op::ProcessSubIn(idx), Op::ProcessSubOut(idx), Op::TrapSet(idx).

use fusevm::{ShellHost, VM, Chunk, Value};

struct MyHost;
impl ShellHost for MyHost {
    fn glob(&mut self, pattern: &str, _recursive: bool) -> Vec<String> { vec![] }
    fn tilde_expand(&mut self, s: &str) -> String { s.into() }
    fn cmd_subst(&mut self, sub: &Chunk) -> String { String::new() }
    // … other methods have default impls
}

let mut vm = VM::new(chunk);
vm.set_shell_host(Box::new(MyHost));

[0x05] AWK OPS

61 first-class Op::Awk* variants dispatch through the AwkHost trait. AWK's data model (numeric-string duality, CONVFMT/OFMT coercion, $0/$n/NF field coupling, SUBSEP arrays, regex, getline/printf I/O) lives in the frontend (awkrs), so most AWK ops require a registered host; without one they stay inert but stack-balanced.

Twenty-nine builtins are the exception — they execute natively even with no host registered. Most are pure on fusevm::Value; rand/srand run against a VM-owned PRNG seed, and strftime/mktime read the system timezone but need no AWK runtime state.

Group	Host-free builtins
Strings	`substr` `index` `tolower` `toupper` scalar `length(s)`
Characters (gawk)	`ord` (first char → codepoint), `chr` (codepoint → char)
Math	`int` `sqrt` `sin` `cos` `exp` `log` `atan2` `intdiv` `intdiv0` `mkbool`
Bitwise (gawk)	`and` `or` `xor` `compl` `lshift` `rshift`
Conversion (gawk)	`strtonum` (`0x…` hex, `0…` octal, else longest decimal/float prefix)
Time (gawk)	`systime` `strftime` `mktime` (`chrono`-backed; local-tz + UTC)
PRNG (POSIX/gawk)	`rand` `srand` (glibc LCG over a VM-owned seed, deterministic without a host)

AwkDiv / AwkMod are POSIX awk float divide/modulo that raise a fatal "division by zero attempted" error on a zero divisor (vs the shell-arithmetic Op::Div/Op::Mod, which yield Undef/0); they are interpreter-only. AwkDivJit / AwkModJit are block-JIT-eligible variants with byte-identical interpreter semantics: the block JIT emits a guarded early-exit (compare divisor to 0.0, call the fusevm_jit_awk_div_trap libcall on equality and return a sentinel, else fdiv/fmod). Because the trap libcall is not a registered host-helper id, these chunks skip on-disk cache persistence (in-process JIT only) — frontends that emit only Op::Div/Op::Mod (zshrs/stryke) get byte-identical native code.

AWK control flow has no Value representation (next/nextfile/exit are statements, not expressions). Op::AwkSignal(code) carries it host-free: it halts the current chunk and stashes code in the VM, which the frontend driver reads via VM::awk_signal() after run(). zshrs/stryke never emit it, so awk_signal() stays None for them and Halted is byte-identical.

use fusevm::{VM, ChunkBuilder, Op, Value};

let mut b = ChunkBuilder::new();
let s = b.add_constant(Value::str("hello"));
b.emit(Op::LoadConst(s), 1);
b.emit(Op::LoadInt(2), 1);
b.emit(Op::LoadInt(3), 1);
b.emit(Op::AwkSubstr(3), 1);          // substr("hello", 2, 3)
let mut vm = VM::new(b.build());      // no set_awk_host needed
// vm.run() → "ell"

[0x06] EXTENSION MECHANISM

Language-specific opcodes use Extended(u16, u8) which dispatches through a handler table registered by the frontend. The u16 is the extension op ID (up to 65,535 ops per frontend). The u8 is an inline operand. ExtendedWide(u16, usize) carries a full usize payload for jump targets and large indices.

┌──────────────────────────────────────────────────────────────┐ │ EXTENSION DISPATCH TABLE │ │ │ │ strykelang frontend: │ │ Extended(0, _) → RegexMatch │ │ Extended(1, _) → RegexSubst │ │ Extended(2, _) → HashSlice │ │ Extended(3, _) → ArraySlice │ │ ... │ │ Extended(449, _) → PmapCollect │ │ │ │ zshrs frontend: │ │ Extended(0, _) → HistoryExpand │ │ Extended(1, _) → ZleWidget │ │ Extended(2, _) → ZstyleLookup │ │ ... │ │ Extended(19, _) → ModuleLoad │ │ │ │ awkrs frontend: │ │ Extended(0, _) → FieldGet │ │ Extended(1, _) → FieldSet │ │ Extended(2, _) → PrintColumns │ │ ... │ │ Extended(94, _) → GetlineCmd │ │ │ │ vimlrs frontend (dispatches via CallBuiltin, not Extended): │ │ CallBuiltin(3000, _) → GetVar │ │ CallBuiltin(3001, _) → SetVar │ │ ... │ │ CallBuiltin(3520, _) → FoldTextResult (~510 IDs) │ │ │ │ elisprs frontend: │ │ Extended(0, _) → Truthy │ │ Extended(1, _) → Call │ │ Extended(2, _) → GetVar │ │ ... │ │ Extended(9, _) → MakeClosure │ │ │ │ No conflicts — each frontend owns its own ID space. │ └──────────────────────────────────────────────────────────────┘

[0x07] JIT COMPILATION

The JitCompiler runs three tiers in increasing order of optimization power and compile cost. Each tier covers a disjoint slice of the workload — they don't compete. Compile-time decisions and runtime invocation are mediated through one stateless JitCompiler handle (the actual cache is thread-local).

Tier	Trigger	Coverage	Speculation
`Linear`	first call	Straight-line expression chunks; returns `Value` (int or float)	None — IR matches bytecode exactly
`Block`	≥ 1 invocation (default)	Whole-chunk CFG (loops, branches, fused backedges)	None — slot ops assume i64
`Tracing`	≥ 50 backedges through any loop header	Hot path through anything; recorded loop body compiled with type-specialized IR	Slot-type entry guard + per-branch `brif` guards; deopts to interpreter on guard miss

The block (default 1) and tracing (default 50) warmup thresholds — how many runs before a tier compiles a chunk — are tunable per process via the FUSEVM_JIT_BLOCK_THRESHOLD and FUSEVM_JIT_TRACE_THRESHOLD environment variables (read once per thread when the JIT is first touched), or per thread via TraceJitConfig + JitCompiler::set_config. For workloads that re-run the same scripts repeatedly, pair a low warmup with the on-by-default jit-disk-cache feature: the warmup picks when a tier engages and the disk cache makes the native code free to reload next run — AOT-like speed without explicit AOT. Setting FUSEVM_JIT_BLOCK_THRESHOLD=0 is the most aggressive (block-compile every eligible chunk on its first run, then reload from cache); raise the thresholds again for scripts that genuinely run only once.

Tracing JIT is opt-in per VM via vm.enable_tracing_jit(). When enabled, VM::run() auto-dispatches to all three tiers in priority order (phase 10): block JIT first if the chunk is fully eligible (zero VM-side overhead, direct fn-ptr through the slot pointer); tracing JIT for hot loops in chunks block JIT can't take; interpreter for cold paths and edge cases. Block-eligible chunks short-circuit before tracing JIT records anything, so the two tiers never compete on the same chunk.

Tracing JIT capability matrix

Capability	Phase	Detail
Loop bodies, int slots, no calls	1	Loops with ≤`MAX_TRACE_SLOT` int slots, single backward closing branch
Cross-call inlining (branchless callees)	2	`Op::Call` inlines callee body into trace IR; per-frame slot-variable scope
Caller-frame internal branches with side-exits	3	`if`/`else` in caller frame compiles with `brif` guards + per-branch side-exit blocks
Callee-frame branches, frame materialization on deopt	4	Branches inside inlined callees; `DeoptInfo` out-param materializes synthetic `Frame`s on `vm.frames`
Value-stack reconstruction on deopt (Int + Float)	5 + 5b	Non-empty abstract stack at branch is OK; `stack_kinds` tag distinguishes Int from Float entries
Side-exit deopt counter + auto-blacklist	6	Per-trace `side_exit_count`; blacklist after `MAX_SIDE_EXITS` misses
Persistent metadata export/import	7	`TraceMetadata` (serde-serializable) round-trips through `trace_export` / `trace_import` (and `_all` bulk variants)
Bounded recursion inlining	8	Self-recursive calls inline up to `MAX_INLINE_RECURSION` levels deep before aborting
Side-trace stitching from hot deopt sites	9	Recorder splits `record_anchor_ip` (cache key) from `close_anchor_ip` (loop header); side traces compile via `trace_install_with_kind` and don't loop in their own IR; chained dispatch up to `MAX_TRACE_CHAIN` hops
Synergistic three-tier auto-dispatch	10	`VM::run()` consults block JIT first, then tracing JIT, then interpreter — block-eligible chunks short-circuit before any recording happens
Configurable thresholds + float slots + bulk persist	tune	`TraceJitConfig` for per-thread tuning; `SlotKind::Float` slots stored as i64 bit-patterns and bit-cast through; `trace_export_all` / `trace_import_all` for batch I/O

Persistent native-code disk cache

Enable the jit-disk-cache feature (cargo add fusevm --features jit-disk-cache) to cache compiled native code to disk, skipping Cranelift codegen across process restarts — a big win for workloads that re-launch the VM repeatedly (e.g. running a large test suite over and over). It covers all three tiers (linear, block, tracing) and is on by default once the feature is enabled, writing to ~/.cache/fusevm-jit. Override the directory with the FUSEVM_JIT_CACHE_DIR env var or JitCompiler::set_jit_cache_dir(Some(dir)); disable at runtime with FUSEVM_JIT_CACHE_DIR=off or set_jit_cache_dir(None). This is distinct from the caller-owned TraceMetadata export below: the disk cache persists the finished machine code, while TraceMetadata persists the recorder's decisions (and still pays codegen on restore).

Property	Detail
Tiers cached	Linear, block, and tracing — files tier-tagged `.lin.` / `.blk.` / `.trc.`
Keying	Chunk op-hash; tracing tier also keys on record-anchor IP + a content hash over recorded ops, IPs, slot types, and constants so divergent paths never collide
Loading	mmaps native code + re-patches a small relocation table; W^X handled via `pthread_jit_write_protect_np` + icache invalidation on Apple Silicon, `mprotect` elsewhere
Concurrency	Writes publish via unique temp file + atomic rename — safe for the many-processes-spawning-VMs workload
Size control	Total-size cap (default 256 MiB) with oldest-first (mtime) eviction down to 80% of the cap, applied opportunistically on write. Tune via `FUSEVM_JIT_CACHE_MAX_BYTES` (accepts `k`/`m`/`g` suffixes; `0`/`off`/`unlimited` disables eviction) or `set_jit_cache_max_bytes`; inspect with `jit_cache_size_bytes`, force a pass with `prune_jit_cache`, wipe with `clear_jit_cache`
Transparency	Only eliminates Cranelift codegen time — tier selection, warmup thresholds, and results are identical to an uncached run
Safety	Conservative: any chunk whose code carries a relocation other than a known host-helper call falls back to the in-memory JIT, so an untested target degrades to "no caching" rather than miscompiling
Benchmark	Cached block load ~35 µs vs ~152 µs cold codegen (`cargo bench --features jit-disk-cache --bench jit_disk_cache`)

Speedup over interpreter

Apple M-series, criterion, cargo bench --features jit --bench jit_trace. "Block JIT (direct)" invokes JitCompiler::try_run_block with no VM around it — the floor through the JIT pipeline. "Tracing-JIT VM" measures vm.run() with enable_tracing_jit() set, which auto-dispatches block JIT for these block-eligible chunks (phase 10). The remaining gap between the two columns is purely VM construction + slot copy-in/out per vm.run() call.

Workload	Iterations	Interpreter	Block JIT (direct)	Tracing-JIT VM	VM vs Interp
`counter_loop`	1,000	23.4 µs	305 ns	506 ns	46x
`counter_loop`	10,000	235.5 µs	2.80 µs	2.96 µs	80x
`counter_loop`	100,000	2,474 µs	29.07 µs	27.88 µs	89x
`loop_with_branch`	1,000	39.8 µs	310 ns	487 ns	82x
`loop_with_branch`	10,000	410.7 µs	2.78 µs	2.97 µs	138x
`loop_with_branch`	100,000	4,058 µs	27.48 µs	27.75 µs	146x

Microbenchmarks measure tight integer counter loops in isolation — best case for any JIT. Real-world script speedup is bounded by Amdahl: most shell-script time goes to host calls (fork/exec, I/O, glob, builtins) which no JIT tier touches. Typical numeric inner loops see the kernel speedup; surrounding shell logic doesn't. Numbers above are representative of the JIT pipeline itself, not of any specific workload.

Usage

use fusevm::{ChunkBuilder, JitCompiler, Op, TraceJitConfig, VM, VMResult, Value};

let mut b = ChunkBuilder::new();
// ... emit a counter loop ...
let chunk = b.build();

let mut vm = VM::new(chunk.clone());
// Phase 10: enabling tracing JIT also enables auto-dispatch through
// block JIT for fully-eligible chunks. Pick whichever tier fits.
vm.enable_tracing_jit();
match vm.run() {
    VMResult::Ok(val) => // ran via block JIT, tracing JIT, or interpreter — automatic.
    _ => {}
}

// Per-thread tuning of thresholds (optional).
let jit = JitCompiler::new();
jit.set_config(TraceJitConfig {
    trace_threshold: 20,        // arm earlier on hot loops
    block_threshold: 0,         // block-JIT the whole chunk on its first run (default is 1)
    max_side_exits: 100,        // be more patient with mismatches
    ..TraceJitConfig::defaults()
});

// Caller-owned persistence of recorder decisions (still re-runs codegen on
// restore). For zero-codegen restarts, prefer the jit-disk-cache feature above.
let metas = jit.trace_export_all(&chunk);
std::fs::write("traces.json", serde_json::to_vec(&metas)?)?;

// On next process start: re-import + re-compile in one pass.
let bytes = std::fs::read("traces.json")?;
let metas: Vec<TraceMetadata> = serde_json::from_slice(&bytes)?;
let n = jit.trace_import_all(&chunk, &metas);  // returns # restored

Type	Role
`JitCompiler`	Stateless handle over the thread-local trace + block caches; entry point for all tier APIs
`JitExtension`	Frontend-provided trait registering language-specific extended op JIT support
`TraceJitConfig`	Per-thread tunable thresholds (`trace_threshold`, `block_threshold`, `max_side_exits`, `max_inline_recursion`, `max_trace_chain`, `max_trace_len`)
`SlotKind`	Slot type tag (Int / Float) for the tracing JIT's entry guard. Float slots stored as i64 bit-patterns and bit-cast through
`TraceLookup`	Dispatch outcome at a backward branch (`NotHot` / `StartRecording` / `Ran` / `GuardMismatch` / `Skip`)
`DeoptInfo` / `DeoptFrame`	`#[repr(C)]` out-params trace fns populate to materialize inlined frames + value-stack on side-exit
`TraceMetadata`	Serde-serializable record for persistent trace cache (phase 7); bulk variants `trace_export_all` / `trace_import_all`

[0x08] AHEAD-OF-TIME COMPILATION

The aot feature (src/aot.rs) compiles a whole Chunk to a native object file via Cranelift's ObjectModule, then links it against the fusevm runtime into a standalone executable — with no interpreter dispatch loop at run time. It's a closed-world compiler shared by every frontend, so AOT lives here once and each frontend's --build calls into it. Enable with cargo add fusevm --features aot (implies jit).

Threaded-code baseline

The bytecode dispatch loop (VM::run) is replaced by a native function with one Cranelift block per op. Each op block calls the per-op runtime step (VM::aot_exec_op, reached through the extern "C" fusevm_aot_exec_op shim), which runs that op via the same VM::exec_op the interpreter uses, and returns the next instruction index (or -1 to terminate). The native code branches on that through a central dispatch block.

entry → dispatch(0) dispatch(ip): br_table ip → [block_0, …, block_{n-1}] (default → ret) block_i: next = exec_op(vm, i); if next < 0 → ret else → dispatch(next) ret: finish(vm); return

Routing every op through dispatch (rather than static fall-through) keeps the lowering uniform for data-dependent targets — Op::Jump, the JumpIf* family, and intra-chunk Op::Call/Op::Return, whose target is only known at run time — without the native code ever reading the VM struct layout. The interpreter dispatch loop is gone; the work each op does is unchanged.

Native op specialization

Layered on top of the threaded path, build_entry lowers chunks that are scalar computations directly to native IR (no per-op shim call). analyze_native runs an abstract interpretation over the operand stack — tracking int-vs-bool Kinds, finding basic-block leaders, checking join consistency — and when a region qualifies, build_entry_native emits one Cranelift block per leader with the operand stack held in frontend Variables (an i64 and an f64 per stack position). A fully-scalar loop runs entirely in registers; only the final result is boxed back into the VM. This covers:

Group	Native-lowered ops
Arithmetic / comparison	Integer and float, with `int→float` promotion mirroring the interpreter, `Mod` (integer `srem` with trap-divisor guards, or an `fmod` libcall for floats), and `Pow`/`PowFloat` via a `powf` libcall
Math intrinsics	`Abs`/`Sqrt`/`Ceil`/`Floor`/`Trunc`/`Round` as single instructions; `Sin`/`Cos`/`Tan`/`Exp`/`Log`/`Atan2` via libcalls; `GcdInt`/`LcmInt` as internal Euclid loops; awk scalar ops (`AwkDiv`/`AwkMod` + JIT twins, `AwkSqrtJit`/`AwkLogJit`)
Bitwise / control / stack	Bitwise/shift, `Inc`/`Dec`, booleans (`LoadTrue`/`LoadFalse`/`LogNot`/`LogAnd`/`LogOr`), three-way `Spaceship`, stack shuffles, native control flow (`Jump`/`JumpIf`, incl. value-keeping `JumpIfKeep`)
Slots & globals	Integer slots and globals (`GetVar`/`SetVar`/`DeclareVar`) held in SSA registers under a definite-assignment analysis, plus the fused hot-loop slot super-ops (`PreIncSlot`/…, `AddAssignSlotVoid`, `SlotIncLtIntJumpBack`, `AccumSumLoop` — whose inner `while i < limit` becomes a real native loop)

Inline / shim boundary

Chunks that mix scalar work with heap ops don't fall back wholesale. For sink ops (Print/PrintLn) the native code spills the top register scalars onto the boxed vm.stack (per Kind), runs the op via the shim, and continues — so a hot numeric loop with embedded output stays native. For source ops whose result kind is statically known (AwkGetFieldNum, always Float) it runs the op via the shim then reloads the pushed value into a register with no type guard. Slots and globals are typed by a chunk-wide inferred kind, so a float accumulator (sum += 0.5) lowers to an f64 register.

Partial deopt — one-way exit to the interpreter

Anything the native path can't handle at a given op — a string/array/hash/heap op, a heap constant load, or an operand-type mismatch — becomes a deopt point: the analysis lowers everything around it, stops propagating past it, and codegen emits a deopt there. emit_deopt writes the definitely-assigned register-cached slots/globals back to the VM (a merely maybe-assigned slot at a deopt point forces a wholesale threaded fallback, since a register can't distinguish a real 0 from Undef), spills the live operand stack, and calls fusevm_aot_resume to hand the rest of the run to the interpreter at the deopt ip. Op::Div uses this for its rare divide-by-zero (native fdiv on the common path, deopt only on a zero divisor); GetStatus ($?) is lowered as a statically-typed Status source. A chunk falls back to threaded wholesale only for genuinely structural reasons (stack underflow, inconsistent kind join, mixed-kind slot, non-numeric final result).

build_entry is generic over Cranelift's Module, so the in-memory JIT path that validates the compiler (run_chunk_native) and the on-disk ObjectModule path share identical codegen.

API	Purpose
`aot::compile_object(&chunk, path)`	Emit a relocatable `.o` exporting `fusevm_aot_entry` plus the serialized chunk (`fusevm_aot_chunk_blob` / `…_len`)
`aot::run_chunk_native(&chunk, register)`	Compile in-process via Cranelift and run it — validates codegen end to end
`aot::fusevm_aot_run_embedded()`	Runtime entry for a linked binary: rebuilds the VM from the embedded chunk, calls the frontend's `fusevm_aot_register_builtins`, runs the native entry, maps the result to an exit code

Link the emitted object against a frontend runtime (which provides fusevm_aot_register_builtins) to produce the standalone binary. The staticlib crate-type in Cargo.toml builds libfusevm.a so the object can be linked against the runtime.

[0x09] VMPOOL

VMPool recycles VM instances so callers running many short-lived chunks (REPL, eval loops, batch evaluation) can skip the per-call VM::new() cost. acquire pops a recycled VM and resets its state via VM::reset; release returns it for reuse.

use fusevm::{ChunkBuilder, Op, VMPool, VMResult, Value};

let mut pool = VMPool::new();
for _ in 0..1000 {
    let mut b = ChunkBuilder::new();
    b.emit(Op::LoadInt(40), 1);
    b.emit(Op::LoadInt(2), 1);
    b.emit(Op::Add, 1);
    pool.with(b.build(), |vm| {
        assert!(matches!(vm.run(), VMResult::Ok(Value::Int(42))));
    });
}

The pool wins for chunks where VM::new() cost dominates the run — large globals/name pools (>16 entries, where reset's resize amortizes), many slots (frame Vec capacity is preserved across reuse), or multi-chunk evaluation loops with non-trivial chunk shapes. For uniform tight loops on tiny chunks the pool is actually slower (reset does more bookkeeping than VM::new skips), so the API is shipped to let callers choose. All fourteen sibling frontends (strykelang, awkrs, zshrs, vimlrs, elisprs, rubylang, phplang, pythonrs, node-js, arb, javars, kotlinrs, scalars, groovyrs) drive fusevm::VM through bridge layers backed by a frontend-owned VMPool.

[0x0A] BENCHMARKS

criterion on Apple M-series. cargo bench for all; cargo bench --features jit --bench jit_vs_interp for JIT comparisons. HTML report at target/criterion/report/index.html. Microbenchmarks measure tight loops in isolation — best case for any JIT; real-world script speedup is Amdahl-bounded by host calls (fork/exec, I/O, glob) which no JIT tier touches.

Classic algorithms

Benchmark	Time	Ops/sec
`fib_iterative(35)`	2.7 µs	374k
`fib_recursive(20)` — 21,891 calls	1.28 ms	783
`ackermann(3,4)` — 10,547 calls	774 µs	1.3k
`sum(1..1M)` fused `AccumSumLoop`	142 ns	7.0M
`sum(1..1M)` unfused loop ops	31.0 ms	32
`nested_loop(100×100)`	352 µs	2.8k
`dispatch_nop_1M` — raw dispatch overhead	819 µs	1.22 Gops/sec
`string_build(10k)` via `ConcatConstLoop`	11.9 µs	84k

Interpreter vs Cranelift JIT vs native Rust

Slot-based inputs prevent constant folding — apples-to-apples. The linear JIT is consistently ~1.8x slower than LLVM -O3 on real computation and 13–51x faster than the interpreter.

Workload	Interpreter	JIT (cached)	Native Rust	JIT vs interp
`slot_mixed × 100`	2.2 µs	75 ns	42 ns	29x
`slot_bitwise × 200`	6.6 µs	130 ns	74 ns	51x
`slot_float × 200`	3.1 µs	246 ns	137 ns	13x

Block JIT — loops and branches

Benchmark	Interpreter	Block JIT	Speedup
`sum(1..1M)` unfused loop	30.0 ms	315 µs	95x
`nested_loop(100×100)`	340 µs	9.5 µs	36x

[0x0B] VALUE REPRESENTATION

Value is a tagged enum with fast-path immediates for numbers and booleans, and heap types for strings, arrays, and hashes. String coercion returns Cow<str> via as_str_cow() — borrows Str variants without allocation. Array and hash mutations operate in-place on globals, eliminating clone-modify-writeback.

Variant	Representation	Notes
`Undef`	Tag only	Perl/shell `undef` / unset
`Int(i64)`	Inline 8 bytes	Fast-path integer arithmetic
`Float(f64)`	Inline 8 bytes	IEEE 754 double
`Bool(bool)`	Inline 1 byte	Logical ops, conditionals
`Str(Arc<String>)`	Heap, Arc-shared	UTF-8, `Cow<str>` coercion borrows without alloc
`Array(Vec<Value>)`	Heap, in-place mutation	Ordered collection, direct `ref mut` access
`Hash(HashMap<String, Value>)`	Heap, in-place mutation	Key-value map, direct `ref mut` access
`Status(i32)`	Inline 4 bytes	Exit status (`$?`)
`Ref(Box<Value>)`	Heap	Pass-by-reference, nested structures
`NativeFn(u16)`	Inline 2 bytes	Builtin function pointer ID

[0x0C] CHUNK STRUCTURE

A Chunk is the unit of compiled bytecodes. It contains everything the VM needs to execute a compilation unit — function body, script, or REPL line.

┌────────────────────────────────────────────┐ │ Chunk │ │ │ │ ops: Vec<Op> bytecodes │ │ constants: Vec<Value> constant pool │ │ names: Vec<String> variable name pool │ │ lines: Vec<u32> source line numbers │ │ sub_entries: Vec<(u16,usize)> subroutine IPs │ │ block_ranges: Vec<(usize,usize)> block spans │ │ source: String source file name │ │ │ │ Built by ChunkBuilder::emit() + build() │ │ Serializable via serde (JSON, bincode, etc.) │ └────────────────────────────────────────────┘

[0xFF] LICENSE & LINKS

crates.io · docs.rs · GitHub · strykelang · zshrs · awkrs