// FUSEVM — BYTECODE VM

fusevm v0.14.1 · 224 opcodes · 8 fused superinstructions · 29 shell ops · 3-tier Cranelift JIT auto-dispatch (linear · block · tracing) · 17,304 lines Rust · 7,514 #[test] fns

Report GitHub Issues
// Color scheme

>_FUSEVM REFERENCE

Language-agnostic bytecode VM with fused superinstructions and a three-tier Cranelift JIT — linear (instant), block (CFG, threshold 10), tracing (hot loop body with side-exits, frame materialization, side-trace stitching). Auto-dispatched from VM::run() when tracing is enabled. The shared execution engine behind strykelang, zshrs, and awkrs. For full numbers / subsystem breakdown / file inventory see the Engineering Report.

ONE VM TO RUN THEM ALL.

FUSED SUPERINSTRUCTIONS. EXTENSION DISPATCH. 3-TIER CRANELIFT JIT.

201
opcodes
20
op sections
8
fused superinstructions
29
shell ops
17,304
production lines
7,514
#[test] fns
3
JIT tiers
3
language frontends

fusevm is the shared execution engine behind strykelang, zshrs, and awkrs. Any language frontend compiles to the same Op enum and gets fused hot-loop dispatch, extension opcode tables, stack-based execution with slot-indexed fast paths, and JIT eligibility analysis — for free. The VM doesn’t care which language produced the bytecodes.

stryke registers ~450 extended ops. zshrs registers ~20. awkrs registers ~95. They don’t conflict — each frontend owns its own ID space via Extended(u16, u8). Process control ops (pipes, redirects, globs, file tests) are first-class because multiple frontends need them.

[0x00] ARCHITECTURE

┌─────────────────────────────────────────────────────────────────┐ │ LANGUAGE FRONTENDS │ │ │ │ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │ │ │ strykelang │ │ zshrs │ │ awkrs │ │ │ │ Perl 5 compat │ │ shell compiler│ │ awk compiler │ │ │ │ ~450 ext ops │ │ ~20 ext ops │ │ ~95 ext ops │ │ │ └───────┬───────┘ └───────┬───────┘ └───────┬───────┘ │ │ │ compile │ │ │ │ └──────────────────┼──────────────────┘ │ │ ▼ │ │ ┌─────────────────────┐ │ │ │ ChunkBuilder │ │ │ │ .emit(Op, line) │ │ │ │ .build() → Chunk │ │ │ └─────────┬───────────┘ │ │ │ │ │ ┌─────────────┴─────────────┐ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌────────────────────────┐ │ │ │ VM::run() │ ◀──────▶│ JitCompiler tiers │ │ │ │ match- │ trace │ ┌──────────────────┐ │ │ │ │ dispatch │ invoke │ │ Linear (instant)│ │ │ │ │ interpreter │ ◀─────▶│ │ Block (≥10x) │ │ │ │ │ │ deopt │ │ Tracing (≥50x) │ │ │ │ └──────────────┘ │ └──────────────────┘ │ │ │ │ Cranelift 0.130 │ │ │ │ native x86-64/aarch64 │ │ │ └────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘

The Chunk is the unit of compiled bytecodes: an op array, constant pool, name pool, line-number table, and slot count. The ChunkBuilder emits ops one at a time and resolves forward jumps on .build(). The VM executes via a match-dispatch loop over the op array. The JIT compiler analyzes chunks for eligibility and compiles hot paths to native code via Cranelift.

[0x01] QUICK START

use fusevm::{Op, ChunkBuilder, VM, VMResult, Value}; let mut b = ChunkBuilder::new(); b.emit(Op::LoadInt(40), 1); b.emit(Op::LoadInt(2), 1); b.emit(Op::Add, 1); let mut vm = VM::new(b.build()); match vm.run() { VMResult::Ok(val) => println!("result: {}", val.to_str()), // "42" VMResult::Error(e) => eprintln!("error: {}", e), VMResult::Halted => {} }
// Extension handler — register language-specific ops let mut vm = VM::new(chunk); vm.set_extension_handler(Box::new(|vm, id, arg| { match id { 0 => { /* your custom op */ } 1 => { /* another custom op */ } _ => {} } }));

[0x02] FUSED SUPERINSTRUCTIONS

The performance secret. The compiler detects hot loop patterns and emits single ops instead of multi-op sequences. Each fused op eliminates N−1 dispatch cycles, stack pushes, and branch mispredictions from the hot path.

Fused OpReplacesEffect
AccumSumLoop(sum, i, limit) GetSlot + GetSlot + Add + SetSlot + PreInc + NumLt + JumpIfFalse Entire counted sum loop in one dispatch
SlotIncLtIntJumpBack(slot, limit, target) PreIncSlot + SlotLtIntJumpIfFalse Loop backedge in one dispatch
ConcatConstLoop(const, s, i, limit) LoadConst + ConcatAppendSlot + SlotIncLtIntJumpBack String append loop in one dispatch
PushIntRangeLoop(arr, i, limit) GetSlot + PushArray + ArrayLen + Pop + SlotIncLtIntJumpBack Array push loop in one dispatch
AddAssignSlotVoid(a, b) GetSlot + GetSlot + Add + SetSlot Void-context add-assign, no stack traffic
PreIncSlotVoid(slot) GetSlot + Inc + SetSlot Void-context increment, no stack traffic
SlotLtIntJumpIfFalse(slot, int, target) GetSlot + LoadInt + NumLt + JumpIfFalse Fused compare + branch, no stack traffic
PreIncSlot(slot) GetSlot + Inc + SetSlot + GetSlot Slot pre-increment with push

[0x03] OP CATEGORIES

224 opcodes across 20 sections in src/op.rs. Every op is ≤24 bytes for cache-friendly dispatch.

Constants & Stack

~12 ops

LoadInt LoadFloat LoadConst LoadTrue LoadFalse LoadUndef Pop Dup Dup2 Swap Rot

Variables

~8 ops — name-indexed + slot-indexed fast paths

GetVar SetVar DeclareVar GetSlot SetSlot SlotArrayGet SlotArraySet

Arrays & Hashes

~25 ops — full collection primitives

ArrayPush ArrayPop ArrayShift ArrayLen MakeArray HashGet HashSet HashDelete HashKeys HashValues MakeHash

Arithmetic

9 ops

Add Sub Mul Div Mod Pow Negate Inc Dec

String

3 ops

Concat StringRepeat StringLen

Comparison

~14 ops — numeric + string + three-way

NumEq NumLt NumGe Spaceship StrEq StrLt StrCmp

Logical & Bitwise

9 ops

LogNot LogAnd LogOr BitAnd BitOr BitXor BitNot Shl Shr

Control Flow

5 ops — including short-circuit keep variants

Jump JumpIfTrue JumpIfFalse JumpIfTrueKeep JumpIfFalseKeep

Functions & Scope

5 ops

Call Return ReturnValue PushFrame PopFrame

Higher-Order

5 ops — block-based functional primitives

MapBlock GrepBlock SortBlock SortDefault ForEachBlock

I/O

3 ops

Print PrintLn ReadLine

Collections

2 ops — range generation

Range RangeStep

[0x04] SHELL OPS

Process control is universal enough that multiple frontends need it. These are first-class ops, not extensions — any frontend that targets fusevm gets pipes, redirects, globs, process substitution, and file tests for free.

OpDescription
Exec(n)Spawn external command — pop N args, exec, push exit status
ExecBg(n)Spawn background — like Exec but don’t wait
PipelineBegin(n) / PipelineStage / PipelineEndSet up, wire, and wait for N-stage pipeline
Redirect(fd, op)Redirect fd — write, append, read, clobber, dup, both
HereDoc(idx) / HereStringHere-document from constant pool / here-string from stack
CmdSubst(idx)Command substitution — capture stdout of subprogram
SubshellBegin / SubshellEndIsolate scope for subshell execution
ProcessSubIn(idx) / ProcessSubOut(idx)Process substitution <(cmd) / >(cmd) — push FIFO path
Glob / GlobRecursiveGlob expand pattern from stack — recursive variant is parallel
TestFile(test)File test: -f -d -r -w -x -e -s -L -S -p -b -c
SetStatus / GetStatusLast exit status $?
TrapSet(idx) / TrapCheckSignal trap handler registration + periodic trap check
ExpandParam(mod)18 parameter expansion modifiers: ${:-} ${:=} ${:?} ${:+} ${#} ${/} ${^^} etc.
WordSplit / BraceExpand / TildeExpandIFS word split, brace expansion, tilde expansion

[0x05] EXTENSION MECHANISM

Language-specific opcodes use Extended(u16, u8) which dispatches through a handler table registered by the frontend. The u16 is the extension op ID (up to 65,535 ops per frontend). The u8 is an inline operand. ExtendedWide(u16, usize) carries a full usize payload for jump targets and large indices.

┌──────────────────────────────────────────────────────────────┐ │ EXTENSION DISPATCH TABLE │ │ │ │ strykelang frontend: │ │ Extended(0, _) → RegexMatch │ │ Extended(1, _) → RegexSubst │ │ Extended(2, _) → HashSlice │ │ Extended(3, _) → ArraySlice │ │ ... │ │ Extended(449, _) → PmapCollect │ │ │ │ zshrs frontend: │ │ Extended(0, _) → HistoryExpand │ │ Extended(1, _) → ZleWidget │ │ Extended(2, _) → ZstyleLookup │ │ ... │ │ Extended(19, _) → ModuleLoad │ │ │ │ awkrs frontend: │ │ Extended(0, _) → FieldGet │ │ Extended(1, _) → FieldSet │ │ Extended(2, _) → PrintColumns │ │ ... │ │ Extended(94, _) → GetlineCmd │ │ │ │ No conflicts — each frontend owns its own ID space. │ └──────────────────────────────────────────────────────────────┘

[0x06] JIT COMPILATION

The JitCompiler runs three tiers in increasing order of optimization power and compile cost. Each tier covers a disjoint slice of the workload — they don't compete. Compile-time decisions and runtime invocation are mediated through one stateless JitCompiler handle (the actual cache is thread-local).

TierTriggerCoverageSpeculation
Linearfirst callStraight-line expression chunks; returns Value (int or float)None — IR matches bytecode exactly
Block≥ 1 invocation (default)Whole-chunk CFG (loops, branches, fused backedges)None — slot ops assume i64
Tracing≥ 50 backedges through any loop headerHot path through anything; recorded loop body compiled with type-specialized IRSlot-type entry guard + per-branch brif guards; deopts to interpreter on guard miss

The block (default 1) and tracing (default 50) warmup thresholds — how many runs before a tier compiles a chunk — are tunable per process via the FUSEVM_JIT_BLOCK_THRESHOLD and FUSEVM_JIT_TRACE_THRESHOLD environment variables (read once per thread when the JIT is first touched), or per thread via TraceJitConfig + JitCompiler::set_config. For workloads that re-run the same scripts repeatedly, pair a low warmup with the on-by-default jit-disk-cache feature: the warmup picks when a tier engages and the disk cache makes the native code free to reload next run — AOT-like speed without explicit AOT. Setting FUSEVM_JIT_BLOCK_THRESHOLD=0 is the most aggressive (block-compile every eligible chunk on its first run, then reload from cache); raise the thresholds again for scripts that genuinely run only once.

Tracing JIT is opt-in per VM via vm.enable_tracing_jit(). When enabled, VM::run() auto-dispatches to all three tiers in priority order (phase 10): block JIT first if the chunk is fully eligible (zero VM-side overhead, direct fn-ptr through the slot pointer); tracing JIT for hot loops in chunks block JIT can't take; interpreter for cold paths and edge cases. Block-eligible chunks short-circuit before tracing JIT records anything, so the two tiers never compete on the same chunk.

Tracing JIT capability matrix

CapabilityPhaseDetail
Loop bodies, int slots, no calls1Loops with ≤MAX_TRACE_SLOT int slots, single backward closing branch
Cross-call inlining (branchless callees)2Op::Call inlines callee body into trace IR; per-frame slot-variable scope
Caller-frame internal branches with side-exits3if/else in caller frame compiles with brif guards + per-branch side-exit blocks
Callee-frame branches, frame materialization on deopt4Branches inside inlined callees; DeoptInfo out-param materializes synthetic Frames on vm.frames
Value-stack reconstruction on deopt (Int + Float)5 + 5bNon-empty abstract stack at branch is OK; stack_kinds tag distinguishes Int from Float entries
Side-exit deopt counter + auto-blacklist6Per-trace side_exit_count; blacklist after MAX_SIDE_EXITS misses
Persistent metadata export/import7TraceMetadata (serde-serializable) round-trips through trace_export / trace_import (and _all bulk variants)
Bounded recursion inlining8Self-recursive calls inline up to MAX_INLINE_RECURSION levels deep before aborting
Side-trace stitching from hot deopt sites9Recorder splits record_anchor_ip (cache key) from close_anchor_ip (loop header); side traces compile via trace_install_with_kind and don't loop in their own IR; chained dispatch up to MAX_TRACE_CHAIN hops
Synergistic three-tier auto-dispatch10VM::run() consults block JIT first, then tracing JIT, then interpreter — block-eligible chunks short-circuit before any recording happens
Configurable thresholds + float slots + bulk persisttuneTraceJitConfig for per-thread tuning; SlotKind::Float slots stored as i64 bit-patterns and bit-cast through; trace_export_all / trace_import_all for batch I/O

Persistent native-code disk cache

Enable the jit-disk-cache feature (cargo add fusevm --features jit-disk-cache) to cache compiled native code to disk, skipping Cranelift codegen across process restarts — a big win for workloads that re-launch the VM repeatedly (e.g. running a large test suite over and over). It covers all three tiers (linear, block, tracing) and is on by default once the feature is enabled, writing to ~/.cache/fusevm-jit. Override the directory with the FUSEVM_JIT_CACHE_DIR env var or JitCompiler::set_jit_cache_dir(Some(dir)); disable at runtime with FUSEVM_JIT_CACHE_DIR=off or set_jit_cache_dir(None). This is distinct from the caller-owned TraceMetadata export below: the disk cache persists the finished machine code, while TraceMetadata persists the recorder's decisions (and still pays codegen on restore).

PropertyDetail
Tiers cachedLinear, block, and tracing — files tier-tagged .lin. / .blk. / .trc.
KeyingChunk op-hash; tracing tier also keys on record-anchor IP + a content hash over recorded ops, IPs, slot types, and constants so divergent paths never collide
Loadingmmaps native code + re-patches a small relocation table; W^X handled via pthread_jit_write_protect_np + icache invalidation on Apple Silicon, mprotect elsewhere
ConcurrencyWrites publish via unique temp file + atomic rename — safe for the many-processes-spawning-VMs workload
Size controlTotal-size cap (default 256 MiB) with oldest-first (mtime) eviction down to 80% of the cap, applied opportunistically on write. Tune via FUSEVM_JIT_CACHE_MAX_BYTES (accepts k/m/g suffixes; 0/off/unlimited disables eviction) or set_jit_cache_max_bytes; inspect with jit_cache_size_bytes, force a pass with prune_jit_cache, wipe with clear_jit_cache
TransparencyOnly eliminates Cranelift codegen time — tier selection, warmup thresholds, and results are identical to an uncached run
SafetyConservative: any chunk whose code carries a relocation other than a known host-helper call falls back to the in-memory JIT, so an untested target degrades to "no caching" rather than miscompiling
BenchmarkCached block load ~35 µs vs ~152 µs cold codegen (cargo bench --features jit-disk-cache --bench jit_disk_cache)

Speedup over interpreter

Apple M-series, criterion, cargo bench --features jit --bench jit_trace. "Block JIT (direct)" invokes JitCompiler::try_run_block with no VM around it — the floor through the JIT pipeline. "Tracing-JIT VM" measures vm.run() with enable_tracing_jit() set, which auto-dispatches block JIT for these block-eligible chunks (phase 10). The remaining gap between the two columns is purely VM construction + slot copy-in/out per vm.run() call.

WorkloadIterationsInterpreterBlock JIT (direct)Tracing-JIT VMVM vs Interp
counter_loop1,00023.4 µs305 ns506 ns46x
counter_loop10,000235.5 µs2.80 µs2.96 µs80x
counter_loop100,0002,474 µs29.07 µs27.88 µs89x
loop_with_branch1,00039.8 µs310 ns487 ns82x
loop_with_branch10,000410.7 µs2.78 µs2.97 µs138x
loop_with_branch100,0004,058 µs27.48 µs27.75 µs146x

Microbenchmarks measure tight integer counter loops in isolation — best case for any JIT. Real-world script speedup is bounded by Amdahl: most shell-script time goes to host calls (fork/exec, I/O, glob, builtins) which no JIT tier touches. Typical numeric inner loops see the kernel speedup; surrounding shell logic doesn't. Numbers above are representative of the JIT pipeline itself, not of any specific workload.

Usage

use fusevm::{ChunkBuilder, JitCompiler, Op, TraceJitConfig, VM, VMResult, Value}; let mut b = ChunkBuilder::new(); // ... emit a counter loop ... let chunk = b.build(); let mut vm = VM::new(chunk.clone()); // Phase 10: enabling tracing JIT also enables auto-dispatch through // block JIT for fully-eligible chunks. Pick whichever tier fits. vm.enable_tracing_jit(); match vm.run() { VMResult::Ok(val) => // ran via block JIT, tracing JIT, or interpreter — automatic. _ => {} } // Per-thread tuning of thresholds (optional). let jit = JitCompiler::new(); jit.set_config(TraceJitConfig { trace_threshold: 20, // arm earlier on hot loops block_threshold: 0, // block-JIT the whole chunk on its first run (default is 1) max_side_exits: 100, // be more patient with mismatches ..TraceJitConfig::defaults() }); // Caller-owned persistence of recorder decisions (still re-runs codegen on // restore). For zero-codegen restarts, prefer the jit-disk-cache feature above. let metas = jit.trace_export_all(&chunk); std::fs::write("traces.json", serde_json::to_vec(&metas)?)?; // On next process start: re-import + re-compile in one pass. let bytes = std::fs::read("traces.json")?; let metas: Vec<TraceMetadata> = serde_json::from_slice(&bytes)?; let n = jit.trace_import_all(&chunk, &metas); // returns # restored
TypeRole
JitCompilerStateless handle over the thread-local trace + block caches; entry point for all tier APIs
JitExtensionFrontend-provided trait registering language-specific extended op JIT support
TraceJitConfigPer-thread tunable thresholds (trace_threshold, block_threshold, max_side_exits, max_inline_recursion, max_trace_chain, max_trace_len)
SlotKindSlot type tag (Int / Float) for the tracing JIT's entry guard. Float slots stored as i64 bit-patterns and bit-cast through
TraceLookupDispatch outcome at a backward branch (NotHot / StartRecording / Ran / GuardMismatch / Skip)
DeoptInfo / DeoptFrame#[repr(C)] out-params trace fns populate to materialize inlined frames + value-stack on side-exit
TraceMetadataSerde-serializable record for persistent trace cache (phase 7); bulk variants trace_export_all / trace_import_all

[0x07] VALUE REPRESENTATION

Value is a tagged enum with fast-path immediates for numbers and booleans, and heap types for strings, arrays, and hashes. String coercion returns Cow<str> via as_str_cow() — borrows Str variants without allocation. Array and hash mutations operate in-place on globals, eliminating clone-modify-writeback.

VariantRepresentationNotes
UndefTag onlyPerl/shell undef / unset
Int(i64)Inline 8 bytesFast-path integer arithmetic
Float(f64)Inline 8 bytesIEEE 754 double
Bool(bool)Inline 1 byteLogical ops, conditionals
Str(Arc<String>)Heap, Arc-sharedUTF-8, Cow<str> coercion borrows without alloc
Array(Vec<Value>)Heap, in-place mutationOrdered collection, direct ref mut access
Hash(HashMap<String, Value>)Heap, in-place mutationKey-value map, direct ref mut access
Status(i32)Inline 4 bytesExit status ($?)
Ref(Box<Value>)HeapPass-by-reference, nested structures
NativeFn(u16)Inline 2 bytesBuiltin function pointer ID

[0x08] CHUNK STRUCTURE

A Chunk is the unit of compiled bytecodes. It contains everything the VM needs to execute a compilation unit — function body, script, or REPL line.

┌────────────────────────────────────────────┐ │ Chunk │ │ │ │ ops: Vec<Op> bytecodes │ │ constants: Vec<Value> constant pool │ │ names: Vec<String> variable name pool │ │ lines: Vec<u32> source line numbers │ │ sub_entries: Vec<(u16,usize)> subroutine IPs │ │ block_ranges: Vec<(usize,usize)> block spans │ │ source: String source file name │ │ │ │ Built by ChunkBuilder::emit() + build() │ │ Serializable via serde (JSON, bincode, etc.) │ └────────────────────────────────────────────┘

[0xFF] LICENSE & LINKS

MIT — Copyright © 2026 MenkeTechnologies

crates.io · docs.rs · GitHub · strykelang · zshrs · awkrs

ONE VM TO RUN THEM ALL.