>_FUSEVM REFERENCE
Language-agnostic bytecode VM with fused superinstructions and a three-tier Cranelift JIT — linear (instant), block (CFG, threshold 10), tracing (hot loop body with side-exits, frame materialization, side-trace stitching). Auto-dispatched from VM::run() when tracing is enabled. The shared execution engine behind strykelang, zshrs, and awkrs. For full numbers / subsystem breakdown / file inventory see the Engineering Report.
ONE VM TO RUN THEM ALL.
FUSED SUPERINSTRUCTIONS. EXTENSION DISPATCH. 3-TIER CRANELIFT JIT.
fusevm is the shared execution engine behind
strykelang,
zshrs, and
awkrs.
Any language frontend compiles to the same Op enum and gets fused hot-loop dispatch,
extension opcode tables, stack-based execution with slot-indexed fast paths, and JIT eligibility
analysis — for free. The VM doesn’t care which language produced the bytecodes.
stryke registers ~450 extended ops. zshrs registers ~20. awkrs registers ~95. They don’t conflict —
each frontend owns its own ID space via Extended(u16, u8). Process control ops
(pipes, redirects, globs, file tests) are first-class because multiple frontends need them.
[0x00] ARCHITECTURE
The Chunk is the unit of compiled bytecodes: an op array, constant pool, name pool,
line-number table, and slot count. The ChunkBuilder emits ops one at a time and
resolves forward jumps on .build(). The VM executes via a match-dispatch
loop over the op array. The JIT compiler analyzes chunks for eligibility and compiles hot paths
to native code via Cranelift.
[0x01] QUICK START
[0x02] FUSED SUPERINSTRUCTIONS
The performance secret. The compiler detects hot loop patterns and emits single ops instead of multi-op sequences. Each fused op eliminates N−1 dispatch cycles, stack pushes, and branch mispredictions from the hot path.
| Fused Op | Replaces | Effect |
|---|---|---|
AccumSumLoop(sum, i, limit) |
GetSlot + GetSlot + Add + SetSlot + PreInc + NumLt + JumpIfFalse |
Entire counted sum loop in one dispatch |
SlotIncLtIntJumpBack(slot, limit, target) |
PreIncSlot + SlotLtIntJumpIfFalse |
Loop backedge in one dispatch |
ConcatConstLoop(const, s, i, limit) |
LoadConst + ConcatAppendSlot + SlotIncLtIntJumpBack |
String append loop in one dispatch |
PushIntRangeLoop(arr, i, limit) |
GetSlot + PushArray + ArrayLen + Pop + SlotIncLtIntJumpBack |
Array push loop in one dispatch |
AddAssignSlotVoid(a, b) |
GetSlot + GetSlot + Add + SetSlot |
Void-context add-assign, no stack traffic |
PreIncSlotVoid(slot) |
GetSlot + Inc + SetSlot |
Void-context increment, no stack traffic |
SlotLtIntJumpIfFalse(slot, int, target) |
GetSlot + LoadInt + NumLt + JumpIfFalse |
Fused compare + branch, no stack traffic |
PreIncSlot(slot) |
GetSlot + Inc + SetSlot + GetSlot |
Slot pre-increment with push |
[0x03] OP CATEGORIES
224 opcodes across 20 sections in src/op.rs. Every op is ≤24 bytes for cache-friendly dispatch.
Constants & Stack
~12 ops
LoadInt LoadFloat LoadConst LoadTrue LoadFalse LoadUndef Pop Dup Dup2 Swap Rot
Variables
~8 ops — name-indexed + slot-indexed fast paths
GetVar SetVar DeclareVar GetSlot SetSlot SlotArrayGet SlotArraySet
Arrays & Hashes
~25 ops — full collection primitives
ArrayPush ArrayPop ArrayShift ArrayLen MakeArray HashGet HashSet HashDelete HashKeys HashValues MakeHash
Arithmetic
9 ops
Add Sub Mul Div Mod Pow Negate Inc Dec
String
3 ops
Concat StringRepeat StringLen
Comparison
~14 ops — numeric + string + three-way
NumEq NumLt NumGe Spaceship StrEq StrLt StrCmp
Logical & Bitwise
9 ops
LogNot LogAnd LogOr BitAnd BitOr BitXor BitNot Shl Shr
Control Flow
5 ops — including short-circuit keep variants
Jump JumpIfTrue JumpIfFalse JumpIfTrueKeep JumpIfFalseKeep
Functions & Scope
5 ops
Call Return ReturnValue PushFrame PopFrame
Higher-Order
5 ops — block-based functional primitives
MapBlock GrepBlock SortBlock SortDefault ForEachBlock
I/O
3 ops
Print PrintLn ReadLine
Collections
2 ops — range generation
Range RangeStep
[0x04] SHELL OPS
Process control is universal enough that multiple frontends need it. These are first-class ops, not extensions — any frontend that targets fusevm gets pipes, redirects, globs, process substitution, and file tests for free.
| Op | Description |
|---|---|
Exec(n) | Spawn external command — pop N args, exec, push exit status |
ExecBg(n) | Spawn background — like Exec but don’t wait |
PipelineBegin(n) / PipelineStage / PipelineEnd | Set up, wire, and wait for N-stage pipeline |
Redirect(fd, op) | Redirect fd — write, append, read, clobber, dup, both |
HereDoc(idx) / HereString | Here-document from constant pool / here-string from stack |
CmdSubst(idx) | Command substitution — capture stdout of subprogram |
SubshellBegin / SubshellEnd | Isolate scope for subshell execution |
ProcessSubIn(idx) / ProcessSubOut(idx) | Process substitution <(cmd) / >(cmd) — push FIFO path |
Glob / GlobRecursive | Glob expand pattern from stack — recursive variant is parallel |
TestFile(test) | File test: -f -d -r -w -x -e -s -L -S -p -b -c |
SetStatus / GetStatus | Last exit status $? |
TrapSet(idx) / TrapCheck | Signal trap handler registration + periodic trap check |
ExpandParam(mod) | 18 parameter expansion modifiers: ${:-} ${:=} ${:?} ${:+} ${#} ${/} ${^^} etc. |
WordSplit / BraceExpand / TildeExpand | IFS word split, brace expansion, tilde expansion |
[0x05] EXTENSION MECHANISM
Language-specific opcodes use Extended(u16, u8) which dispatches through a handler
table registered by the frontend. The u16 is the extension op ID (up to 65,535 ops
per frontend). The u8 is an inline operand. ExtendedWide(u16, usize)
carries a full usize payload for jump targets and large indices.
[0x06] JIT COMPILATION
The JitCompiler runs three tiers in increasing order of optimization power and
compile cost. Each tier covers a disjoint slice of the workload — they don't compete.
Compile-time decisions and runtime invocation are mediated through one stateless
JitCompiler handle (the actual cache is thread-local).
| Tier | Trigger | Coverage | Speculation |
|---|---|---|---|
Linear | first call | Straight-line expression chunks; returns Value (int or float) | None — IR matches bytecode exactly |
Block | ≥ 1 invocation (default) | Whole-chunk CFG (loops, branches, fused backedges) | None — slot ops assume i64 |
Tracing | ≥ 50 backedges through any loop header | Hot path through anything; recorded loop body compiled with type-specialized IR | Slot-type entry guard + per-branch brif guards; deopts to interpreter on guard miss |
The block (default 1) and tracing (default 50) warmup thresholds — how many runs before a tier compiles a chunk — are tunable per process via the
FUSEVM_JIT_BLOCK_THRESHOLD and FUSEVM_JIT_TRACE_THRESHOLD environment variables (read once per thread when the JIT is first touched), or per thread via
TraceJitConfig + JitCompiler::set_config. For workloads that re-run the same scripts repeatedly, pair a low warmup with the on-by-default jit-disk-cache feature: the warmup picks when a tier engages and the disk cache makes the native code free to reload next run — AOT-like speed without explicit AOT. Setting FUSEVM_JIT_BLOCK_THRESHOLD=0 is the most aggressive (block-compile every eligible chunk on its first run, then reload from cache); raise the thresholds again for scripts that genuinely run only once.
Tracing JIT is opt-in per VM via vm.enable_tracing_jit(). When enabled,
VM::run() auto-dispatches to all three tiers in priority order (phase 10): block JIT
first if the chunk is fully eligible (zero VM-side overhead, direct fn-ptr through the slot pointer);
tracing JIT for hot loops in chunks block JIT can't take; interpreter for cold paths and edge cases.
Block-eligible chunks short-circuit before tracing JIT records anything, so the two tiers never
compete on the same chunk.
Tracing JIT capability matrix
| Capability | Phase | Detail |
|---|---|---|
| Loop bodies, int slots, no calls | 1 | Loops with ≤MAX_TRACE_SLOT int slots, single backward closing branch |
| Cross-call inlining (branchless callees) | 2 | Op::Call inlines callee body into trace IR; per-frame slot-variable scope |
| Caller-frame internal branches with side-exits | 3 | if/else in caller frame compiles with brif guards + per-branch side-exit blocks |
| Callee-frame branches, frame materialization on deopt | 4 | Branches inside inlined callees; DeoptInfo out-param materializes synthetic Frames on vm.frames |
| Value-stack reconstruction on deopt (Int + Float) | 5 + 5b | Non-empty abstract stack at branch is OK; stack_kinds tag distinguishes Int from Float entries |
| Side-exit deopt counter + auto-blacklist | 6 | Per-trace side_exit_count; blacklist after MAX_SIDE_EXITS misses |
| Persistent metadata export/import | 7 | TraceMetadata (serde-serializable) round-trips through trace_export / trace_import (and _all bulk variants) |
| Bounded recursion inlining | 8 | Self-recursive calls inline up to MAX_INLINE_RECURSION levels deep before aborting |
| Side-trace stitching from hot deopt sites | 9 | Recorder splits record_anchor_ip (cache key) from close_anchor_ip (loop header); side traces compile via trace_install_with_kind and don't loop in their own IR; chained dispatch up to MAX_TRACE_CHAIN hops |
| Synergistic three-tier auto-dispatch | 10 | VM::run() consults block JIT first, then tracing JIT, then interpreter — block-eligible chunks short-circuit before any recording happens |
| Configurable thresholds + float slots + bulk persist | tune | TraceJitConfig for per-thread tuning; SlotKind::Float slots stored as i64 bit-patterns and bit-cast through; trace_export_all / trace_import_all for batch I/O |
Persistent native-code disk cache
Enable the jit-disk-cache feature (cargo add fusevm --features jit-disk-cache)
to cache compiled native code to disk, skipping Cranelift codegen across process
restarts — a big win for workloads that re-launch the VM repeatedly (e.g. running a large test
suite over and over). It covers all three tiers (linear, block, tracing) and is
on by default once the feature is enabled, writing to ~/.cache/fusevm-jit.
Override the directory with the FUSEVM_JIT_CACHE_DIR env var or
JitCompiler::set_jit_cache_dir(Some(dir)); disable at runtime with
FUSEVM_JIT_CACHE_DIR=off or set_jit_cache_dir(None). This is distinct from
the caller-owned TraceMetadata export below: the disk cache persists the finished
machine code, while TraceMetadata persists the recorder's decisions (and still pays
codegen on restore).
| Property | Detail |
|---|---|
| Tiers cached | Linear, block, and tracing — files tier-tagged .lin. / .blk. / .trc. |
| Keying | Chunk op-hash; tracing tier also keys on record-anchor IP + a content hash over recorded ops, IPs, slot types, and constants so divergent paths never collide |
| Loading | mmaps native code + re-patches a small relocation table; W^X handled via pthread_jit_write_protect_np + icache invalidation on Apple Silicon, mprotect elsewhere |
| Concurrency | Writes publish via unique temp file + atomic rename — safe for the many-processes-spawning-VMs workload |
| Size control | Total-size cap (default 256 MiB) with oldest-first (mtime) eviction down to 80% of the cap, applied opportunistically on write. Tune via FUSEVM_JIT_CACHE_MAX_BYTES (accepts k/m/g suffixes; 0/off/unlimited disables eviction) or set_jit_cache_max_bytes; inspect with jit_cache_size_bytes, force a pass with prune_jit_cache, wipe with clear_jit_cache |
| Transparency | Only eliminates Cranelift codegen time — tier selection, warmup thresholds, and results are identical to an uncached run |
| Safety | Conservative: any chunk whose code carries a relocation other than a known host-helper call falls back to the in-memory JIT, so an untested target degrades to "no caching" rather than miscompiling |
| Benchmark | Cached block load ~35 µs vs ~152 µs cold codegen (cargo bench --features jit-disk-cache --bench jit_disk_cache) |
Speedup over interpreter
Apple M-series, criterion, cargo bench --features jit --bench jit_trace. "Block JIT (direct)"
invokes JitCompiler::try_run_block with no VM around it — the floor through the JIT
pipeline. "Tracing-JIT VM" measures vm.run() with enable_tracing_jit() set,
which auto-dispatches block JIT for these block-eligible chunks (phase 10). The remaining gap
between the two columns is purely VM construction + slot copy-in/out per vm.run() call.
| Workload | Iterations | Interpreter | Block JIT (direct) | Tracing-JIT VM | VM vs Interp |
|---|---|---|---|---|---|
counter_loop | 1,000 | 23.4 µs | 305 ns | 506 ns | 46x |
counter_loop | 10,000 | 235.5 µs | 2.80 µs | 2.96 µs | 80x |
counter_loop | 100,000 | 2,474 µs | 29.07 µs | 27.88 µs | 89x |
loop_with_branch | 1,000 | 39.8 µs | 310 ns | 487 ns | 82x |
loop_with_branch | 10,000 | 410.7 µs | 2.78 µs | 2.97 µs | 138x |
loop_with_branch | 100,000 | 4,058 µs | 27.48 µs | 27.75 µs | 146x |
Microbenchmarks measure tight integer counter loops in isolation — best case for any JIT. Real-world script speedup is bounded by Amdahl: most shell-script time goes to host calls (fork/exec, I/O, glob, builtins) which no JIT tier touches. Typical numeric inner loops see the kernel speedup; surrounding shell logic doesn't. Numbers above are representative of the JIT pipeline itself, not of any specific workload.
Usage
| Type | Role |
|---|---|
JitCompiler | Stateless handle over the thread-local trace + block caches; entry point for all tier APIs |
JitExtension | Frontend-provided trait registering language-specific extended op JIT support |
TraceJitConfig | Per-thread tunable thresholds (trace_threshold, block_threshold, max_side_exits, max_inline_recursion, max_trace_chain, max_trace_len) |
SlotKind | Slot type tag (Int / Float) for the tracing JIT's entry guard. Float slots stored as i64 bit-patterns and bit-cast through |
TraceLookup | Dispatch outcome at a backward branch (NotHot / StartRecording / Ran / GuardMismatch / Skip) |
DeoptInfo / DeoptFrame | #[repr(C)] out-params trace fns populate to materialize inlined frames + value-stack on side-exit |
TraceMetadata | Serde-serializable record for persistent trace cache (phase 7); bulk variants trace_export_all / trace_import_all |
[0x07] VALUE REPRESENTATION
Value is a tagged enum with fast-path immediates for numbers and booleans,
and heap types for strings, arrays, and hashes. String coercion returns Cow<str>
via as_str_cow() — borrows Str variants without allocation.
Array and hash mutations operate in-place on globals, eliminating clone-modify-writeback.
| Variant | Representation | Notes |
|---|---|---|
Undef | Tag only | Perl/shell undef / unset |
Int(i64) | Inline 8 bytes | Fast-path integer arithmetic |
Float(f64) | Inline 8 bytes | IEEE 754 double |
Bool(bool) | Inline 1 byte | Logical ops, conditionals |
Str(Arc<String>) | Heap, Arc-shared | UTF-8, Cow<str> coercion borrows without alloc |
Array(Vec<Value>) | Heap, in-place mutation | Ordered collection, direct ref mut access |
Hash(HashMap<String, Value>) | Heap, in-place mutation | Key-value map, direct ref mut access |
Status(i32) | Inline 4 bytes | Exit status ($?) |
Ref(Box<Value>) | Heap | Pass-by-reference, nested structures |
NativeFn(u16) | Inline 2 bytes | Builtin function pointer ID |
[0x08] CHUNK STRUCTURE
A Chunk is the unit of compiled bytecodes. It contains everything the VM needs
to execute a compilation unit — function body, script, or REPL line.
[0xFF] LICENSE & LINKS
MIT — Copyright © 2026 MenkeTechnologies
crates.io · docs.rs · GitHub · strykelang · zshrs · awkrs
ONE VM TO RUN THEM ALL.