// AWKRS — AWK IN RUST

awkrs v0.4.14 · Rust-powered · Parallel records · Cranelift JIT · rkyv bytecode cache · 2,116 parity tests vs gawk/mawk/BSD awk

Report GitHub Issues
// Color scheme

>_AWKRS REFERENCE

A fast AWK implementation written in Rust. Bytecode VM with optional Cranelift JIT, parallel record processing with rayon, and broad CLI compatibility with gawk, mawk, and nawk. Drop-in replacement for text processing pipelines.

Quickstart

Install from crates.io or build from source, then use aw (short) or awkrs:

# install
cargo install awkrs

# from source
git clone https://github.com/MenkeTechnologies/awkrs
cd awkrs && cargo build

# one-liners
aw 'BEGIN { print "hello, world" }'
aw -F: '{ print $1 }' /etc/passwd
aw '{ sum += $1 } END { print sum }' numbers.txt
echo "1 2 3" | aw '{ print $1 + $2 + $3 }'

# field processing
ls -l | aw 'NR > 1 { total += $5 } END { print total }'

# pattern matching
aw '/error/i { print FILENAME ":" NR ":" $0 }' *.log

Full install + usage live in the README.

Why awkrs — Feature Comparison

Feature awkrs gawk mawk nawk
Parallel records
JIT compilationCranelift
Bytecode VM
Persistent bytecode cache
Unicode supportpartial
CSV mode
Regex backrefs
Time functions
I18N (gettext)
Network I/O
Single binary~8MBpkg~200KBpkg
Memory safetyRustCCC

Overview

  • Parser & compiler — recursive-descent parser producing an AST, compiled to bytecode for the VM. Hot paths can be JIT-compiled via Cranelift.
  • Values — AWK values (string/number/uninitialized) with automatic coercion. Arrays are associative (hash maps).
  • Regex — three-tier engine: Rust regexfancy-regex (backrefs) → pcre2 (advanced).
  • Parallelism-P flag enables parallel record processing via rayon work-stealing.
  • Bytecode cache-f script.awk runs memoize compiled bytecode to ~/.awkrs/scripts.rkyv; repeat invocations skip lex/parse/compile.
  • Binary size — ~8MB stripped with LTO.

Built-in Variables

VariableDescription
$0Current input record (entire line)
$1, $2, ...Fields of the current record
NFNumber of fields in current record
NRTotal number of records read so far
FNRRecord number in current file
FILENAMEName of current input file
FSInput field separator (default: space)
RSInput record separator (default: newline)
OFSOutput field separator
ORSOutput record separator
OFMTOutput format for numbers
CONVFMTConversion format for numbers
SUBSEPSubscript separator for arrays
RSTARTStart of match from match()
RLENGTHLength of match from match()
ARGC, ARGVCommand-line argument count and array
ENVIRONEnvironment variables array

Built-in Functions

String

length gsub sub match split substr index sprintf tolower toupper

Math

sin cos atan2 exp log sqrt int rand srand

I/O

print printf getline close fflush system

Time (gawk)

systime mktime strftime

Bit ops (gawk)

and or xor compl lshift rshift

Type (gawk)

typeof isarray

Array (gawk)

asort asorti delete

Regex (gawk)

gensub patsplit

Examples

Field extraction

aw -F: '{ print $1, $3 }' /etc/passwd # username and UID
aw '{ print $NF }' file.txt # last field of each line

Aggregation

aw '{ sum += $1 } END { print sum }' numbers.txt
aw '{ count[$1]++ } END { for (k in count) print k, count[k] }' data.txt

Pattern matching

aw '/^#/ { next } { print }' config.txt # skip comments
aw 'NR == 1 || /error/' log.txt # header + error lines

Text transformation

aw '{ gsub(/foo/, "bar"); print }' file.txt
aw 'BEGIN { OFS="," } { $1=$1; print }' file.txt # to CSV

Multi-file processing

aw 'FNR == 1 { print "--- " FILENAME " ---" } { print }' *.txt

CLI Flags

-f FILE            # read program from file
-F FS              # set field separator
-v VAR=VAL         # set variable before execution
-b                 # binary mode (no UTF-8)
-c                 # CSV mode
-d                 # debug: dump variables
-e PROG            # program text (multiple allowed)
-E FILE            # like -f, but different variable handling
-g                 # GNU regex mode
-i FILE            # include file (library)
-k                 # CSV mode with header
-l LIB             # load extension library
-M                 # arbitrary precision math
-n                 # no implicit input loop
-N                 # decimal context for -M
-o FILE            # pretty-print to file
-O                 # optimize (enable JIT)
-p FILE            # profile output
-P                 # POSIX mode
-r                 # extended regex (ERE)
-s                 # sandbox mode
-S                 # sandbox + safe mode
-t                 # lint-old compatibility warnings
-V                 # version
-W OPT             # gawk-style option

Parallel Processing

Use -P or --parallel to enable parallel record processing. Each record is processed independently using rayon work-stealing across all CPU cores.

# process large file in parallel
aw -P '{ complex_computation($0) }' huge_file.txt

# parallel aggregation (thread-safe)
aw -P '{ sum += $1 } END { print sum }' data.txt

Note: Parallel mode may reorder output. Use -P -s for sorted output by record number.

Bytecode Cache

Based on a survey of the major public awk implementations (BWK awk, gawk, mawk, goawk, frawk, zawk), awkrs appears to be the first awk implementation to combine a bytecode VM, a JIT compiler, and a persistent on-disk bytecode cache. frawk is the closest prior art — it has VM + Cranelift/LLVM JIT — but re-compiles on every invocation. gawk's pm-gawk persists script variables, not compiled bytecode.

Implementation Bytecode VM JIT Persistent bytecode cache
BWK awk (one-true-awk)
gawk
mawk
goawk
frawk✓ (Cranelift + LLVM)
zawk (frawk fork)✓ (Cranelift + LLVM)
awkrs✓ (Cranelift)

Invocations of the form awkrs -f script.awk ... memoize the compiled CompiledProgram to an rkyv-archived shard at ~/.awkrs/scripts.rkyv. The cache hit path is mmap + zero-copy ArchivedHashMap lookup + bincode-decode of the matched entry's inner blob. Same outer-rkyv / inner-bincode pattern used by zshrs and stryke.

# 1st run: parse + compile + populate ~/.awkrs/scripts.rkyv
awkrs -f script.awk input.txt

# 2nd run: cache hit, skips parse/compile
awkrs -f script.awk input.txt

# disable the cache for one run
AWKRS_CACHE=0 awkrs -f script.awk input.txt

# wipe the cache
rm ~/.awkrs/scripts.rkyv

Invalidation is automatic and silent:

  • Source mtime change — editing script.awk causes the next run to miss and recompile.
  • Binary mtime newer than entry — rebuilding awkrs (any cargo install / cargo build) invalidates every entry so old bytecode never runs against new code.
  • Schema / version drift — package version, format-version byte, and host pointer width are all validated; a fresh shard is written on mismatch.

Engagement criteria — the cache only kicks in for the simple -f script.awk form. The following skip the cache because they need the AST or are short-lived modes:

  • inline programs (-e/--source, -E, or bare-arg form like awkrs '{print $1}' file)
  • multi-source assembly (-i/--include, -l/--load, or multiple -f)
  • AST-only flags: --debug, --lint, --lint-old, --pretty-print, --gen-pot

Storage — single rkyv archive, atomic-rename writes, flock(LOCK_EX) on a sibling lockfile so concurrent awkrs processes serialize their writes. Reads are unlocked, mmap'd, and rely on rkyv's check_archived_root byte-validation plus the magic/version header.

gawk Extensions

awkrs implements many gawk extensions for compatibility:

  • BEGINFILE / ENDFILE — run before/after each input file
  • nextfile — skip to next input file
  • @include — include another awk file
  • @namespace — namespace support
  • Typed regex@/regex/ strongly typed regex constants
  • Indirect function calls@func_name()
  • Two-way pipes|& for coprocess communication
  • Network I/O/inet/tcp/... special files
  • Time functionssystime(), mktime(), strftime()
  • Bit operationsand(), or(), xor(), etc.

Repository & Links