>_FUSEVM REFERENCE
Language-agnostic bytecode VM with fused superinstructions and a three-tier Cranelift JIT — linear (instant), block (CFG, threshold 10), tracing (hot loop body with side-exits, frame materialization, side-trace stitching). Auto-dispatched from VM::run() when tracing is enabled. The shared execution engine behind strykelang, zshrs, and awkrs. For full numbers / subsystem breakdown / file inventory see the Engineering Report.
ONE VM TO RUN THEM ALL.
FUSED SUPERINSTRUCTIONS. EXTENSION DISPATCH. 3-TIER CRANELIFT JIT.
fusevm is the shared execution engine behind
strykelang,
zshrs, and
awkrs.
Any language frontend compiles to the same Op enum and gets fused hot-loop dispatch,
extension opcode tables, stack-based execution with slot-indexed fast paths, and JIT eligibility
analysis — for free. The VM doesn’t care which language produced the bytecodes.
stryke registers ~450 extended ops. zshrs registers ~20. awkrs registers ~95. They don’t conflict —
each frontend owns its own ID space via Extended(u16, u8). Process control ops
(pipes, redirects, globs, file tests) are first-class because multiple frontends need them.
[0x00] ARCHITECTURE
The Chunk is the unit of compiled bytecodes: an op array, constant pool, name pool,
line-number table, and slot count. The ChunkBuilder emits ops one at a time and
resolves forward jumps on .build(). The VM executes via a match-dispatch
loop over the op array. The JIT compiler analyzes chunks for eligibility and compiles hot paths
to native code via Cranelift.
[0x01] QUICK START
| Feature | Effect |
|---|---|
default | None — pure-Rust interpreter, runtime deps serde / tracing / glob / chrono |
jit | Cranelift-backed native JIT (linear, block, and tracing tiers). Pulls in Cranelift 0.130 |
jit-disk-cache | Persists compiled native code to ~/.cache/fusevm-jit so codegen is skipped across process restarts. Implies jit; on by default once enabled. Pulls in libc for executable memory mapping |
[0x02] FUSED SUPERINSTRUCTIONS
The performance secret. The compiler detects hot loop patterns and emits single ops instead of multi-op sequences. Each fused op eliminates N−1 dispatch cycles, stack pushes, and branch mispredictions from the hot path.
| Fused Op | Replaces | Effect |
|---|---|---|
AccumSumLoop(sum, i, limit) |
GetSlot + GetSlot + Add + SetSlot + PreInc + NumLt + JumpIfFalse |
Entire counted sum loop in one dispatch |
SlotIncLtIntJumpBack(slot, limit, target) |
PreIncSlot + SlotLtIntJumpIfFalse |
Loop backedge in one dispatch |
ConcatConstLoop(const, s, i, limit) |
LoadConst + ConcatAppendSlot + SlotIncLtIntJumpBack |
String append loop in one dispatch |
PushIntRangeLoop(arr, i, limit) |
GetSlot + PushArray + ArrayLen + Pop + SlotIncLtIntJumpBack |
Array push loop in one dispatch |
AddAssignSlotVoid(a, b) |
GetSlot + GetSlot + Add + SetSlot |
Void-context add-assign, no stack traffic |
PreIncSlotVoid(slot) |
GetSlot + Inc + SetSlot |
Void-context increment, no stack traffic |
SlotLtIntJumpIfFalse(slot, int, target) |
GetSlot + LoadInt + NumLt + JumpIfFalse |
Fused compare + branch, no stack traffic |
PreIncSlot(slot) |
GetSlot + Inc + SetSlot + GetSlot |
Slot pre-increment with push |
[0x03] OP CATEGORIES
224 opcodes across 17 sections in src/op.rs. Every op is ≤24 bytes for cache-friendly dispatch.
Constants & Stack
~12 ops
LoadInt LoadFloat LoadConst LoadTrue LoadFalse LoadUndef Pop Dup Dup2 Swap Rot
Variables
~8 ops — name-indexed + slot-indexed fast paths
GetVar SetVar DeclareVar GetSlot SetSlot SlotArrayGet SlotArraySet
Arrays & Hashes
~25 ops — full collection primitives
ArrayPush ArrayPop ArrayShift ArrayLen MakeArray HashGet HashSet HashDelete HashKeys HashValues MakeHash
Arithmetic
9 ops
Add Sub Mul Div Mod Pow Negate Inc Dec
String
3 ops
Concat StringRepeat StringLen
Comparison
~14 ops — numeric + string + three-way
NumEq NumLt NumGe Spaceship StrEq StrLt StrCmp
Logical & Bitwise
9 ops
LogNot LogAnd LogOr BitAnd BitOr BitXor BitNot Shl Shr
Control Flow
5 ops — including short-circuit keep variants
Jump JumpIfTrue JumpIfFalse JumpIfTrueKeep JumpIfFalseKeep
Functions & Scope
5 ops
Call Return ReturnValue PushFrame PopFrame
Higher-Order
5 ops — block-based functional primitives
MapBlock GrepBlock SortBlock SortDefault ForEachBlock
I/O
3 ops
Print PrintLn ReadLine
Collections
2 ops — range generation
Range RangeStep
Fused
11 ops — hot-loop superinstructions (see [0x02] Fused Superinstructions)
AccumSumLoop SlotIncLtIntJumpBack ConcatConstLoop PushIntRangeLoop PreIncSlot PostIncSlot PreDecSlot PostDecSlot
Builtins
1 op — CallBuiltin(id, argc) dispatches 140 builtin IDs in shell_builtins.rs
CallBuiltin
Shell Ops
29 ops — first-class process control (see [0x04])
Exec PipelineBegin Redirect Glob TestFile CmdSubst RegexMatch
AWK Ops
61 ops — first-class Op::Awk* variants (see [0x05])
AwkFieldGet AwkPrint AwkStrtonum AwkDivJit AwkModJit AwkGensub AwkOrd AwkChr AwkMkbool AwkIntdiv
Extension
2 ops — frontend-specific dispatch (see [0x06])
Extended(u16, u8) ExtendedWide(u16, usize)
[0x04] SHELL OPS
Process control is universal enough that multiple frontends need it. These are first-class ops, not extensions — any frontend that targets fusevm gets pipes, redirects, globs, process substitution, and file tests for free.
| Op | Description |
|---|---|
Exec(n) | Spawn external command — pop N args, exec, push exit status |
ExecBg(n) | Spawn background — like Exec but don’t wait |
PipelineBegin(n) / PipelineStage / PipelineEnd | Set up, wire, and wait for N-stage pipeline |
Redirect(fd, op) | Redirect fd — write, append, read, clobber, dup, both |
HereDoc(idx) / HereString | Here-document from constant pool / here-string from stack |
CmdSubst(idx) | Command substitution — capture stdout of subprogram |
SubshellBegin / SubshellEnd | Isolate scope for subshell execution |
ProcessSubIn(idx) / ProcessSubOut(idx) | Process substitution <(cmd) / >(cmd) — push FIFO path |
Glob / GlobRecursive | Glob expand pattern from stack — recursive variant is parallel |
TestFile(test) | File test: -f -d -r -w -x -e -s -L -S -p -b -c |
SetStatus / GetStatus | Last exit status $? |
TrapSet(idx) / TrapCheck | Signal trap handler registration + periodic trap check |
ExpandParam(mod) | 18 parameter expansion modifiers: ${:-} ${:=} ${:?} ${:+} ${#} ${/} ${^^} etc. |
WordSplit / BraceExpand / TildeExpand | IFS word split, brace expansion, tilde expansion |
Shell Host trait
Shell-specific runtime ops (Glob, TildeExpand, BraceExpand, WordSplit,
ExpandParam, CmdSubst, ProcessSubIn/Out, Redirect,
HereDoc, HereString, PipelineBegin/Stage/End,
SubshellBegin/End, TrapSet/TrapCheck,
WithRedirectsBegin/End, CallFunction, StrMatch, RegexMatch)
dispatch through the ShellHost trait. The frontend (zshrs) provides a real implementation;
without one the VM uses minimal stubs that keep stack discipline correct. Sub-execution (command
substitution, process substitution, trap handlers) is delivered to the host as &Chunk
references taken from the parent's sub_chunks table — build them with
ChunkBuilder::add_sub_chunk(sub) -> u16 and reference by index in
Op::CmdSubst(idx), Op::ProcessSubIn(idx), Op::ProcessSubOut(idx),
Op::TrapSet(idx).
[0x05] AWK OPS
61 first-class Op::Awk* variants dispatch through the AwkHost trait. AWK's data
model (numeric-string duality, CONVFMT/OFMT coercion,
$0/$n/NF field coupling, SUBSEP arrays, regex,
getline/printf I/O) lives in the frontend (awkrs), so most AWK ops require a
registered host; without one they stay inert but stack-balanced.
Twenty-nine builtins are the exception — they execute natively even with no
host registered. Most are pure on fusevm::Value; rand/srand run
against a VM-owned PRNG seed, and strftime/mktime read the system timezone but
need no AWK runtime state.
| Group | Host-free builtins |
|---|---|
| Strings | substr index tolower toupper scalar length(s) |
| Characters (gawk) | ord (first char → codepoint), chr (codepoint → char) |
| Math | int sqrt sin cos exp log atan2 intdiv intdiv0 mkbool |
| Bitwise (gawk) | and or xor compl lshift rshift |
| Conversion (gawk) | strtonum (0x… hex, 0… octal, else longest decimal/float prefix) |
| Time (gawk) | systime strftime mktime (chrono-backed; local-tz + UTC) |
| PRNG (POSIX/gawk) | rand srand (glibc LCG over a VM-owned seed, deterministic without a host) |
AwkDiv / AwkMod are POSIX awk float divide/modulo that raise a fatal
"division by zero attempted" error on a zero divisor (vs the shell-arithmetic
Op::Div/Op::Mod, which yield Undef/0); they are
interpreter-only. AwkDivJit / AwkModJit are block-JIT-eligible variants with
byte-identical interpreter semantics: the block JIT emits a guarded early-exit (compare divisor to
0.0, call the fusevm_jit_awk_div_trap libcall on equality and return a sentinel,
else fdiv/fmod). Because the trap libcall is not a registered host-helper id,
these chunks skip on-disk cache persistence (in-process JIT only) — frontends that emit only
Op::Div/Op::Mod (zshrs/stryke) get byte-identical native code.
AWK control flow has no Value representation (next/nextfile/exit
are statements, not expressions). Op::AwkSignal(code) carries it host-free: it halts the current
chunk and stashes code in the VM, which the frontend driver reads via VM::awk_signal()
after run(). zshrs/stryke never emit it, so awk_signal() stays None for
them and Halted is byte-identical.
[0x06] EXTENSION MECHANISM
Language-specific opcodes use Extended(u16, u8) which dispatches through a handler
table registered by the frontend. The u16 is the extension op ID (up to 65,535 ops
per frontend). The u8 is an inline operand. ExtendedWide(u16, usize)
carries a full usize payload for jump targets and large indices.
[0x07] JIT COMPILATION
The JitCompiler runs three tiers in increasing order of optimization power and
compile cost. Each tier covers a disjoint slice of the workload — they don't compete.
Compile-time decisions and runtime invocation are mediated through one stateless
JitCompiler handle (the actual cache is thread-local).
| Tier | Trigger | Coverage | Speculation |
|---|---|---|---|
Linear | first call | Straight-line expression chunks; returns Value (int or float) | None — IR matches bytecode exactly |
Block | ≥ 1 invocation (default) | Whole-chunk CFG (loops, branches, fused backedges) | None — slot ops assume i64 |
Tracing | ≥ 50 backedges through any loop header | Hot path through anything; recorded loop body compiled with type-specialized IR | Slot-type entry guard + per-branch brif guards; deopts to interpreter on guard miss |
The block (default 1) and tracing (default 50) warmup thresholds — how many runs before a tier compiles a chunk — are tunable per process via the
FUSEVM_JIT_BLOCK_THRESHOLD and FUSEVM_JIT_TRACE_THRESHOLD environment variables (read once per thread when the JIT is first touched), or per thread via
TraceJitConfig + JitCompiler::set_config. For workloads that re-run the same scripts repeatedly, pair a low warmup with the on-by-default jit-disk-cache feature: the warmup picks when a tier engages and the disk cache makes the native code free to reload next run — AOT-like speed without explicit AOT. Setting FUSEVM_JIT_BLOCK_THRESHOLD=0 is the most aggressive (block-compile every eligible chunk on its first run, then reload from cache); raise the thresholds again for scripts that genuinely run only once.
Tracing JIT is opt-in per VM via vm.enable_tracing_jit(). When enabled,
VM::run() auto-dispatches to all three tiers in priority order (phase 10): block JIT
first if the chunk is fully eligible (zero VM-side overhead, direct fn-ptr through the slot pointer);
tracing JIT for hot loops in chunks block JIT can't take; interpreter for cold paths and edge cases.
Block-eligible chunks short-circuit before tracing JIT records anything, so the two tiers never
compete on the same chunk.
Tracing JIT capability matrix
| Capability | Phase | Detail |
|---|---|---|
| Loop bodies, int slots, no calls | 1 | Loops with ≤MAX_TRACE_SLOT int slots, single backward closing branch |
| Cross-call inlining (branchless callees) | 2 | Op::Call inlines callee body into trace IR; per-frame slot-variable scope |
| Caller-frame internal branches with side-exits | 3 | if/else in caller frame compiles with brif guards + per-branch side-exit blocks |
| Callee-frame branches, frame materialization on deopt | 4 | Branches inside inlined callees; DeoptInfo out-param materializes synthetic Frames on vm.frames |
| Value-stack reconstruction on deopt (Int + Float) | 5 + 5b | Non-empty abstract stack at branch is OK; stack_kinds tag distinguishes Int from Float entries |
| Side-exit deopt counter + auto-blacklist | 6 | Per-trace side_exit_count; blacklist after MAX_SIDE_EXITS misses |
| Persistent metadata export/import | 7 | TraceMetadata (serde-serializable) round-trips through trace_export / trace_import (and _all bulk variants) |
| Bounded recursion inlining | 8 | Self-recursive calls inline up to MAX_INLINE_RECURSION levels deep before aborting |
| Side-trace stitching from hot deopt sites | 9 | Recorder splits record_anchor_ip (cache key) from close_anchor_ip (loop header); side traces compile via trace_install_with_kind and don't loop in their own IR; chained dispatch up to MAX_TRACE_CHAIN hops |
| Synergistic three-tier auto-dispatch | 10 | VM::run() consults block JIT first, then tracing JIT, then interpreter — block-eligible chunks short-circuit before any recording happens |
| Configurable thresholds + float slots + bulk persist | tune | TraceJitConfig for per-thread tuning; SlotKind::Float slots stored as i64 bit-patterns and bit-cast through; trace_export_all / trace_import_all for batch I/O |
Persistent native-code disk cache
Enable the jit-disk-cache feature (cargo add fusevm --features jit-disk-cache)
to cache compiled native code to disk, skipping Cranelift codegen across process
restarts — a big win for workloads that re-launch the VM repeatedly (e.g. running a large test
suite over and over). It covers all three tiers (linear, block, tracing) and is
on by default once the feature is enabled, writing to ~/.cache/fusevm-jit.
Override the directory with the FUSEVM_JIT_CACHE_DIR env var or
JitCompiler::set_jit_cache_dir(Some(dir)); disable at runtime with
FUSEVM_JIT_CACHE_DIR=off or set_jit_cache_dir(None). This is distinct from
the caller-owned TraceMetadata export below: the disk cache persists the finished
machine code, while TraceMetadata persists the recorder's decisions (and still pays
codegen on restore).
| Property | Detail |
|---|---|
| Tiers cached | Linear, block, and tracing — files tier-tagged .lin. / .blk. / .trc. |
| Keying | Chunk op-hash; tracing tier also keys on record-anchor IP + a content hash over recorded ops, IPs, slot types, and constants so divergent paths never collide |
| Loading | mmaps native code + re-patches a small relocation table; W^X handled via pthread_jit_write_protect_np + icache invalidation on Apple Silicon, mprotect elsewhere |
| Concurrency | Writes publish via unique temp file + atomic rename — safe for the many-processes-spawning-VMs workload |
| Size control | Total-size cap (default 256 MiB) with oldest-first (mtime) eviction down to 80% of the cap, applied opportunistically on write. Tune via FUSEVM_JIT_CACHE_MAX_BYTES (accepts k/m/g suffixes; 0/off/unlimited disables eviction) or set_jit_cache_max_bytes; inspect with jit_cache_size_bytes, force a pass with prune_jit_cache, wipe with clear_jit_cache |
| Transparency | Only eliminates Cranelift codegen time — tier selection, warmup thresholds, and results are identical to an uncached run |
| Safety | Conservative: any chunk whose code carries a relocation other than a known host-helper call falls back to the in-memory JIT, so an untested target degrades to "no caching" rather than miscompiling |
| Benchmark | Cached block load ~35 µs vs ~152 µs cold codegen (cargo bench --features jit-disk-cache --bench jit_disk_cache) |
Speedup over interpreter
Apple M-series, criterion, cargo bench --features jit --bench jit_trace. "Block JIT (direct)"
invokes JitCompiler::try_run_block with no VM around it — the floor through the JIT
pipeline. "Tracing-JIT VM" measures vm.run() with enable_tracing_jit() set,
which auto-dispatches block JIT for these block-eligible chunks (phase 10). The remaining gap
between the two columns is purely VM construction + slot copy-in/out per vm.run() call.
| Workload | Iterations | Interpreter | Block JIT (direct) | Tracing-JIT VM | VM vs Interp |
|---|---|---|---|---|---|
counter_loop | 1,000 | 23.4 µs | 305 ns | 506 ns | 46x |
counter_loop | 10,000 | 235.5 µs | 2.80 µs | 2.96 µs | 80x |
counter_loop | 100,000 | 2,474 µs | 29.07 µs | 27.88 µs | 89x |
loop_with_branch | 1,000 | 39.8 µs | 310 ns | 487 ns | 82x |
loop_with_branch | 10,000 | 410.7 µs | 2.78 µs | 2.97 µs | 138x |
loop_with_branch | 100,000 | 4,058 µs | 27.48 µs | 27.75 µs | 146x |
Microbenchmarks measure tight integer counter loops in isolation — best case for any JIT. Real-world script speedup is bounded by Amdahl: most shell-script time goes to host calls (fork/exec, I/O, glob, builtins) which no JIT tier touches. Typical numeric inner loops see the kernel speedup; surrounding shell logic doesn't. Numbers above are representative of the JIT pipeline itself, not of any specific workload.
Usage
| Type | Role |
|---|---|
JitCompiler | Stateless handle over the thread-local trace + block caches; entry point for all tier APIs |
JitExtension | Frontend-provided trait registering language-specific extended op JIT support |
TraceJitConfig | Per-thread tunable thresholds (trace_threshold, block_threshold, max_side_exits, max_inline_recursion, max_trace_chain, max_trace_len) |
SlotKind | Slot type tag (Int / Float) for the tracing JIT's entry guard. Float slots stored as i64 bit-patterns and bit-cast through |
TraceLookup | Dispatch outcome at a backward branch (NotHot / StartRecording / Ran / GuardMismatch / Skip) |
DeoptInfo / DeoptFrame | #[repr(C)] out-params trace fns populate to materialize inlined frames + value-stack on side-exit |
TraceMetadata | Serde-serializable record for persistent trace cache (phase 7); bulk variants trace_export_all / trace_import_all |
[0x08] VMPOOL
VMPool recycles VM instances so callers running many short-lived chunks
(REPL, eval loops, batch evaluation) can skip the per-call VM::new() cost.
acquire pops a recycled VM and resets its state via VM::reset;
release returns it for reuse.
The pool wins for chunks where VM::new() cost dominates the run — large globals/name
pools (>16 entries, where reset's resize amortizes), many slots (frame Vec capacity is preserved
across reuse), or multi-chunk evaluation loops with non-trivial chunk shapes. For uniform tight loops
on tiny chunks the pool is actually slower (reset does more bookkeeping than
VM::new skips), so the API is shipped to let callers choose. All three sibling frontends
(strykelang, awkrs, zshrs) drive fusevm::VM through bridge layers backed by a frontend-owned
VMPool.
[0x09] BENCHMARKS
criterion on Apple M-series. cargo bench for all; cargo bench --features jit --bench
jit_vs_interp for JIT comparisons. HTML report at target/criterion/report/index.html.
Microbenchmarks measure tight loops in isolation — best case for any JIT; real-world script
speedup is Amdahl-bounded by host calls (fork/exec, I/O, glob) which no JIT tier touches.
Classic algorithms
| Benchmark | Time | Ops/sec |
|---|---|---|
fib_iterative(35) | 2.7 µs | 374k |
fib_recursive(20) — 21,891 calls | 1.28 ms | 783 |
ackermann(3,4) — 10,547 calls | 774 µs | 1.3k |
sum(1..1M) fused AccumSumLoop | 142 ns | 7.0M |
sum(1..1M) unfused loop ops | 31.0 ms | 32 |
nested_loop(100×100) | 352 µs | 2.8k |
dispatch_nop_1M — raw dispatch overhead | 819 µs | 1.22 Gops/sec |
string_build(10k) via ConcatConstLoop | 11.9 µs | 84k |
Interpreter vs Cranelift JIT vs native Rust
Slot-based inputs prevent constant folding — apples-to-apples. The linear JIT is consistently
~1.8x slower than LLVM -O3 on real computation and 13–51x faster than the interpreter.
| Workload | Interpreter | JIT (cached) | Native Rust | JIT vs interp |
|---|---|---|---|---|
slot_mixed × 100 | 2.2 µs | 75 ns | 42 ns | 29x |
slot_bitwise × 200 | 6.6 µs | 130 ns | 74 ns | 51x |
slot_float × 200 | 3.1 µs | 246 ns | 137 ns | 13x |
Block JIT — loops and branches
| Benchmark | Interpreter | Block JIT | Speedup |
|---|---|---|---|
sum(1..1M) unfused loop | 30.0 ms | 315 µs | 95x |
nested_loop(100×100) | 340 µs | 9.5 µs | 36x |
[0x0A] VALUE REPRESENTATION
Value is a tagged enum with fast-path immediates for numbers and booleans,
and heap types for strings, arrays, and hashes. String coercion returns Cow<str>
via as_str_cow() — borrows Str variants without allocation.
Array and hash mutations operate in-place on globals, eliminating clone-modify-writeback.
| Variant | Representation | Notes |
|---|---|---|
Undef | Tag only | Perl/shell undef / unset |
Int(i64) | Inline 8 bytes | Fast-path integer arithmetic |
Float(f64) | Inline 8 bytes | IEEE 754 double |
Bool(bool) | Inline 1 byte | Logical ops, conditionals |
Str(Arc<String>) | Heap, Arc-shared | UTF-8, Cow<str> coercion borrows without alloc |
Array(Vec<Value>) | Heap, in-place mutation | Ordered collection, direct ref mut access |
Hash(HashMap<String, Value>) | Heap, in-place mutation | Key-value map, direct ref mut access |
Status(i32) | Inline 4 bytes | Exit status ($?) |
Ref(Box<Value>) | Heap | Pass-by-reference, nested structures |
NativeFn(u16) | Inline 2 bytes | Builtin function pointer ID |
[0x0B] CHUNK STRUCTURE
A Chunk is the unit of compiled bytecodes. It contains everything the VM needs
to execute a compilation unit — function body, script, or REPL line.
[0xFF] LICENSE & LINKS
MIT — Copyright © 2026 MenkeTechnologies
crates.io · docs.rs · GitHub · strykelang · zshrs · awkrs
ONE VM TO RUN THEM ALL.