>_AWKRS REFERENCE
A fast AWK implementation written in Rust. Bytecode VM with optional Cranelift JIT, parallel record processing with rayon, and broad CLI compatibility with gawk, mawk, and nawk. Drop-in replacement for text processing pipelines.
Quickstart
Install from crates.io or build from source, then use aw (short) or awkrs:
# install
cargo install awkrs
# from source
git clone https://github.com/MenkeTechnologies/awkrs
cd awkrs && cargo build
# one-liners
aw 'BEGIN { print "hello, world" }'
aw -F: '{ print $1 }' /etc/passwd
aw '{ sum += $1 } END { print sum }' numbers.txt
echo "1 2 3" | aw '{ print $1 + $2 + $3 }'
# field processing
ls -l | aw 'NR > 1 { total += $5 } END { print total }'
# pattern matching
aw '/error/i { print FILENAME ":" NR ":" $0 }' *.log
Full install + usage live in the README.
Why awkrs — Feature Comparison
| Feature | awkrs | gawk | mawk | nawk |
|---|---|---|---|---|
| Parallel records | ✓ | ✗ | ✗ | ✗ |
| JIT compilation | Cranelift | ✗ | ✗ | ✗ |
| Bytecode VM | ✓ | ✓ | ✓ | ✗ |
| Persistent bytecode cache | ✓ | ✗ | ✗ | ✗ |
| Unicode support | ✓ | ✓ | partial | ✗ |
| CSV mode | ✓ | ✓ | ✗ | ✗ |
| Regex backrefs | ✓ | ✓ | ✗ | ✗ |
| Time functions | ✓ | ✓ | ✗ | ✗ |
| I18N (gettext) | ✓ | ✓ | ✗ | ✗ |
| Network I/O | ✓ | ✓ | ✗ | ✗ |
| Single binary | ~8MB | pkg | ~200KB | pkg |
| Memory safety | Rust | C | C | C |
Overview
- Parser & compiler — recursive-descent parser producing an AST, compiled to bytecode for the VM. Hot paths can be JIT-compiled via Cranelift.
- Values — AWK values (string/number/uninitialized) with automatic coercion. Arrays are associative (hash maps).
- Regex — three-tier engine: Rust
regex→fancy-regex(backrefs) →pcre2(advanced). - Parallelism —
-Pflag enables parallel record processing via rayon work-stealing. - Bytecode cache —
-f script.awkruns memoize compiled bytecode to~/.awkrs/scripts.rkyv; repeat invocations skip lex/parse/compile. - Binary size — ~8MB stripped with LTO.
Built-in Variables
| Variable | Description |
|---|---|
$0 | Current input record (entire line) |
$1, $2, ... | Fields of the current record |
NF | Number of fields in current record |
NR | Total number of records read so far |
FNR | Record number in current file |
FILENAME | Name of current input file |
FS | Input field separator (default: space) |
RS | Input record separator (default: newline) |
OFS | Output field separator |
ORS | Output record separator |
OFMT | Output format for numbers |
CONVFMT | Conversion format for numbers |
SUBSEP | Subscript separator for arrays |
RSTART | Start of match from match() |
RLENGTH | Length of match from match() |
ARGC, ARGV | Command-line argument count and array |
ENVIRON | Environment variables array |
Built-in Functions
String
length gsub sub match split substr index sprintf tolower toupper
Math
sin cos atan2 exp log sqrt int rand srand
I/O
print printf getline close fflush system
Time (gawk)
systime mktime strftime
Bit ops (gawk)
and or xor compl lshift rshift
Type (gawk)
typeof isarray
Array (gawk)
asort asorti delete
Regex (gawk)
gensub patsplit
Examples
Field extraction
Aggregation
Pattern matching
Text transformation
Multi-file processing
CLI Flags
-f FILE # read program from file -F FS # set field separator -v VAR=VAL # set variable before execution -b # binary mode (no UTF-8) -c # CSV mode -d # debug: dump variables -e PROG # program text (multiple allowed) -E FILE # like -f, but different variable handling -g # GNU regex mode -i FILE # include file (library) -k # CSV mode with header -l LIB # load extension library -M # arbitrary precision math -n # no implicit input loop -N # decimal context for -M -o FILE # pretty-print to file -O # optimize (enable JIT) -p FILE # profile output -P # POSIX mode -r # extended regex (ERE) -s # sandbox mode -S # sandbox + safe mode -t # lint-old compatibility warnings -V # version -W OPT # gawk-style option
Parallel Processing
Use -P or --parallel to enable parallel record processing. Each record is processed independently using rayon work-stealing across all CPU cores.
# process large file in parallel
aw -P '{ complex_computation($0) }' huge_file.txt
# parallel aggregation (thread-safe)
aw -P '{ sum += $1 } END { print sum }' data.txt
Note: Parallel mode may reorder output. Use -P -s for sorted output by record number.
Bytecode Cache
Based on a survey of the major public awk implementations (BWK awk, gawk, mawk, goawk, frawk, zawk), awkrs appears to be the first awk implementation to combine a bytecode VM, a JIT compiler, and a persistent on-disk bytecode cache. frawk is the closest prior art — it has VM + Cranelift/LLVM JIT — but re-compiles on every invocation. gawk's pm-gawk persists script variables, not compiled bytecode.
| Implementation | Bytecode VM | JIT | Persistent bytecode cache |
|---|---|---|---|
| BWK awk (one-true-awk) | ✗ | ✗ | ✗ |
| gawk | ✓ | ✗ | ✗ |
| mawk | ✓ | ✗ | ✗ |
| goawk | ✓ | ✗ | ✗ |
| frawk | ✓ | ✓ (Cranelift + LLVM) | ✗ |
| zawk (frawk fork) | ✓ | ✓ (Cranelift + LLVM) | ✗ |
| awkrs | ✓ | ✓ (Cranelift) | ✓ |
Invocations of the form awkrs -f script.awk ... memoize the compiled CompiledProgram to an rkyv-archived shard at ~/.awkrs/scripts.rkyv. The cache hit path is mmap + zero-copy ArchivedHashMap lookup + bincode-decode of the matched entry's inner blob. Same outer-rkyv / inner-bincode pattern used by zshrs and stryke.
# 1st run: parse + compile + populate ~/.awkrs/scripts.rkyv awkrs -f script.awk input.txt # 2nd run: cache hit, skips parse/compile awkrs -f script.awk input.txt # disable the cache for one run AWKRS_CACHE=0 awkrs -f script.awk input.txt # wipe the cache rm ~/.awkrs/scripts.rkyv
Invalidation is automatic and silent:
- Source mtime change — editing
script.awkcauses the next run to miss and recompile. - Binary mtime newer than entry — rebuilding awkrs (any
cargo install/cargo build) invalidates every entry so old bytecode never runs against new code. - Schema / version drift — package version, format-version byte, and host pointer width are all validated; a fresh shard is written on mismatch.
Engagement criteria — the cache only kicks in for the simple -f script.awk form. The following skip the cache because they need the AST or are short-lived modes:
- inline programs (
-e/--source,-E, or bare-arg form likeawkrs '{print $1}' file) - multi-source assembly (
-i/--include,-l/--load, or multiple-f) - AST-only flags:
--debug,--lint,--lint-old,--pretty-print,--gen-pot
Storage — single rkyv archive, atomic-rename writes, flock(LOCK_EX) on a sibling lockfile so concurrent awkrs processes serialize their writes. Reads are unlocked, mmap'd, and rely on rkyv's check_archived_root byte-validation plus the magic/version header.
gawk Extensions
awkrs implements many gawk extensions for compatibility:
- BEGINFILE / ENDFILE — run before/after each input file
- nextfile — skip to next input file
- @include — include another awk file
- @namespace — namespace support
- Typed regex —
@/regex/strongly typed regex constants - Indirect function calls —
@func_name() - Two-way pipes —
|&for coprocess communication - Network I/O —
/inet/tcp/...special files - Time functions —
systime(),mktime(),strftime() - Bit operations —
and(),or(),xor(), etc.
Repository & Links
- Engineering report — report.html (architecture, perf stack, test coverage, competitive landscape)
- Source — github.com/MenkeTechnologies/awkrs
- Crate — crates.io/crates/awkrs (
cargo install awkrs) - Rust API docs — docs.rs/awkrs
- Issues — github.com/MenkeTechnologies/awkrs/issues
- Parity tests —
parity/holds 2,116 cases run byte-for-byte against gawk, mawk, and BSD awk on every push (2,054 POSIX-portable acrosscases/+cases_portable/, 62 gawk-only incases_gawk/fortypeof,FPAT,gensub,**=,strftime, bit ops, arity messages).