awkrs — Documentation

>_AWKRS REFERENCE

A fast AWK implementation written in Rust. Bytecode VM with optional Cranelift JIT, parallel record processing with rayon, and broad CLI compatibility with gawk, mawk, and nawk. Drop-in replacement for text processing pipelines.

Quickstart

Install from crates.io or build from source, then use aw (short) or awkrs:

# Homebrew tap
brew tap MenkeTechnologies/menketech
brew install awkrs

# crates.io
cargo install awkrs

# from source
git clone https://github.com/MenkeTechnologies/awkrs
cd awkrs && cargo build --release

# zsh completion
fpath=(/path/to/awkrs/completions $fpath)
autoload -Uz compinit && compinit

# one-liners
aw 'BEGIN { print "hello, world" }'
aw -F: '{ print $1 }' /etc/passwd
aw '{ sum += $1 } END { print sum }' numbers.txt
echo "1 2 3" | aw '{ print $1 + $2 + $3 }'

# field processing
ls -l | aw 'NR > 1 { total += $5 } END { print total }'

# pattern matching
aw '/error/i { print FILENAME ":" NR ":" $0 }' *.log

Full install + usage live in the README.

Why awkrs — Feature Comparison

Feature	awkrs	gawk	mawk	nawk
Parallel records	✓	✗	✗	✗
JIT compilation	Cranelift	✗	✗	✗
Bytecode VM	✓	✓	✓	✗
Persistent bytecode cache	✓	✗	✗	✗
Unicode support	✓	✓	partial	✗
CSV mode	✓	✓	✗	✗
Regex backrefs	✓	✓	✗	✗
Time functions	✓	✓	✗	✗
I18N (gettext)	✓	✓	✗	✗
Network I/O	✓	✓	✗	✗
Single binary	~8MB	pkg	~200KB	pkg
Memory safety	Rust	C	C	C

Overview

Parser & compiler — recursive-descent parser producing an AST, compiled to bytecode for the VM. Hot paths can be JIT-compiled via Cranelift.
Values — AWK values (string/number/uninitialized) with automatic coercion. Arrays are associative (hash maps).
Regex — three-tier engine: Rust regex → fancy-regex (backrefs) → pcre2 (advanced).
fusevm offload — eligible numeric bytecode chunks lower to the shared fusevm VM (also used by zshrs and stryke) and JIT-compile via Cranelift; AWKRS_FUSEVM=0 forces the bytecode interpreter for every chunk.
Parallelism — -j N / --threads N (default 1) processes records in parallel via rayon when the program is parallel-safe; output is reordered to input order.
Bytecode cache — -f script.awk runs memoize the compiled program to ~/.awkrs/scripts.rkyv; repeat invocations skip lex/parse/compile.
Binary size — single binary built with LTO; see the comparison table below.

Built-in Variables

Variable	Description
`$0`	Current input record (entire line)
`$1, $2, ...`	Fields of the current record
`NF`	Number of fields in current record
`NR`	Total number of records read so far
`FNR`	Record number in current file
`FILENAME`	Name of current input file
`FS`	Input field separator (default: space)
`RS`	Input record separator (default: newline)
`OFS`	Output field separator
`ORS`	Output record separator
`OFMT`	Output format for numbers
`CONVFMT`	Conversion format for numbers
`SUBSEP`	Subscript separator for arrays
`RSTART`	Start of match from `match()`
`RLENGTH`	Length of match from `match()`
`ARGC, ARGV`	Argument count and array; set before `BEGIN`. `ARGV[0]` is the executable, `ARGV[1..]` are file paths
`ENVIRON`	Environment variables array
`RT`	Text matched by `RS` for the current record (gawk)
`FPAT`	Field-pattern regex: non-empty matches define fields (gawk)
`FIELDWIDTHS`	Fixed-width field layout (gawk)
`IGNORECASE`	Case-insensitive regex and comparisons (gawk)
`ARGIND`	Index in `ARGV` of the current file (gawk)
`ERRNO`	System error string after a failed I/O / `getline` (gawk)
`PROCINFO`	Process / runtime info array (gawk)
`SYMTAB`	Array view of global variables (gawk)
`FUNCTAB`	Array of defined function names (gawk)
`BINMODE`	Binary I/O mode (gawk)
`LINT`	Toggle lint diagnostics at runtime (gawk)
`TEXTDOMAIN`	gettext text domain for translatable strings (gawk)

All special variable names are pinned in src/namespace.rs (SPECIAL_GLOBAL_NAMES, 28 entries) and stay in the global namespace under @namespace.

Built-in Functions

The 61 builtin names are pinned in src/namespace.rs (BUILTIN_NAMES) and are exempt from @namespace prefixing.

String

length substr index split sprintf tolower toupper

Regex / substitution

gsub sub match gensub patsplit — gensub expands \N capture-group backrefs

Math

sin cos atan2 exp log sqrt int rand srand intdiv intdiv0

I/O

print printf getline close fflush system

Time (gawk)

systime mktime strftime gettimeofday getlocaltime

Bit ops (gawk)

and or xor compl lshift rshift — and/or/xor are variadic (≥2 args)

Type / convert (gawk)

typeof isarray strtonum mkbool ord chr

Array (gawk)

asort asorti — plus the delete statement

Filesystem (gawk ext)

stat statvfs readdir readfile fts chdir rename sleep fflush

Persistence (gawk ext)

reada writea inplace_tmpfile inplace_commit

I18N (gawk)

bindtextdomain dcgettext dcngettext — .mo catalogs

Two-way I/O (gawk)

revoutput revtwoway

Patterns, Rules & Control Flow

Rules — BEGIN, END, BEGINFILE/ENDFILE, empty pattern, /regex/, expression patterns, range patterns (/a/,/b/ or NR==1,NR==5). Like gawk, the four special patterns must use { … }; record rules may omit braces for the default { print $0 }.
Statements — if/while/do…while/for (C-style and for (i in arr)), switch/case/default (gawk-style: no fall-through, case /re/), print/printf (with >, >>, |, |& redirection), break, continue, next, nextfile, exit, delete, return.
getline forms — primary, getline < file, cmd | getline [var], coprocess cmd |& getline. As an expression it returns 1 (read), 0 (EOF), -1 (error), or -2 (gawk retryable I/O when PROCINFO[input,"RETRY"] is set).
Operators — arithmetic, comparison, string concat, ternary, in, ~/!~, ++/-- (prefix/postfix on vars, $n, a[k]), ^/** (right-associative). Division by zero (/, /=) is a fatal error (gawk-style division by zero attempted), not infinity.

CSV Mode

-k / --csv enables CSV parsing aligned with gawk --csv: RFC-style quoting with the "" escape, comma field separator, and a dedicated parser (not a plain FS=","). Leading commas count as empty fields (,,, → NF=4).

# count rows by a quoted CSV column
aw --csv '{ c[$2]++ } END { for (k in c) print k, c[k] }' data.csv

# re-emit selected columns
aw --csv 'BEGIN { OFS="," } { print $1, $3 }' in.csv

Examples

Field extraction

aw -F: '{ print $1, $3 }' /etc/passwd # username and UID

aw '{ print $NF }' file.txt # last field of each line

Aggregation

aw '{ sum += $1 } END { print sum }' numbers.txt

aw '{ count[$1]++ } END { for (k in count) print k, count[k] }' data.txt

Pattern matching

aw '/^#/ { next } { print }' config.txt # skip comments

aw 'NR == 1 || /error/' log.txt # header + error lines

Text transformation

aw '{ gsub(/foo/, "bar"); print }' file.txt

aw 'BEGIN { OFS="," } { $1=$1; print }' file.txt # to CSV

Multi-file processing

aw 'FNR == 1 { print "--- " FILENAME " ---" } { print }' *.txt

CLI Flags

Flags mirror POSIX awk, GNU gawk, and mawk-style -W. Each short flag has the long form shown. Some gawk flags are accepted for script compatibility and only affect diagnostics; those are noted.

# POSIX
-f, --file PROGFILE        # read program from file (repeatable)
-F, --field-separator FS   # set the input field separator
-v, --assign var=val       # set a variable before execution (repeatable)

# GNU: program sources
-e, --source PROGRAM       # program text (repeatable; combine with -f)
-i, --include FILE         # include an awk library file (AWKPATH; repeatable)
-l, --load LIB             # load extension library by name (repeatable)

# gawk extensions
-b, --characters-as-bytes  # treat characters as bytes (no UTF-8)
-c, --traditional          # traditional (no gawk extensions) compatibility
-C, --copyright            # print copyright / license
-d, --dump-variables[FILE] # dump variable state after the run (stdout / FILE)
-D, --debug[FILE]          # rule / function listing to stderr or FILE
-E, --exec FILE            # run program FILE (gawk -E)
-g, --gen-pot              # extract gettext strings to a .pot template
-I, --trace                # trace mode
-k, --csv                  # CSV mode: comma FS + FPAT quoting ("" escape)
-L, --lint[fatal|invalid|no-ext]  # enable lint diagnostics
-M, --bignum               # arbitrary-precision math via MPFR (rug)
-N, --use-lc-numeric       # apply LC_NUMERIC to printf/print and %' grouping
-n, --non-decimal-data     # recognize 0x / octal in input data
-o, --pretty-print[FILE]   # awk-like AST listing (awkrs format)
-O, --optimize             # accepted; JIT is on unless -s is set
-p, --profile[FILE]        # wall-clock + per-rule summary (awkrs format)
-P, --posix                # POSIX mode (accepted; reserved on Runtime)
-r, --re-interval          # accepted no-op ({m,n} intervals always enabled)
-s, --no-optimize          # disable peephole/JIT optimization
-S, --sandbox              # sandbox mode (no file/command I/O)
-t, --lint-old             # warn on constructs not in old awk

# mawk / BusyBox
-W OPT                     # mawk-style options (help, version, dump, exec=…, …)

# awkrs-specific
-j, --threads N            # worker threads for parallel records (default 1)
    --read-ahead N         # stdin lines per parallel batch (default 1024)
-h, --help                 # cyberpunk HUD help
-V, --version              # print version

Parallel Processing

The default thread count is 1. Pass -j N / --threads N to process records in parallel via rayon. Workers run the same bytecode VM; the compiled program is shared as Arc<CompiledProgram> with per-worker runtime state. Output is reordered to input order within each batch so pipelines stay deterministic.

# process a large file with 8 workers
aw -j 8 '{ s += $1 } END { print s }' data.txt

# parallel field extraction
aw -j 4 -F: '{ print $1 }' big.csv

# tune the stdin batch size (lines per parallel batch)
seq 1 1000000 | aw -j 8 --read-ahead 4096 '{ s += $1 } END { print s }'

Parallel mode engages only when the program is parallel-safe by static check: no range patterns, no exit/nextfile/delete, no primary getline, no pipe/coproc getline, no asort/asorti, no indirect calls, no print/printf redirection, no cross-record assignments. Non-parallel-safe programs (and programs using primary getline on file input) run sequentially, with a warning when -j > 1.

Regular files are memory-mapped (memmap2) and scanned with the same RS rules as the sequential path — no full-file read() copy. Stdin chunks up to --read-ahead lines (default 1024) per batch, dispatches to workers, emits in order, then refills. END only sees post-BEGIN global state; record-rule mutations from parallel workers are not merged back.

Bytecode Cache

Based on a survey of the major public awk implementations (BWK awk, gawk, mawk, goawk, frawk, zawk), awkrs appears to be the first awk implementation to pair a bytecode VM with a persistent on-disk bytecode cache. frawk is the closest prior art on JIT — it has VM + Cranelift/LLVM JIT — but re-compiles on every invocation. gawk's pm-gawk persists script variables, not compiled bytecode. (awkrs's own fusevm/Cranelift JIT offload additionally keeps a separate machine-code cache at ~/.cache/fusevm-jit.)

Implementation	Bytecode VM	JIT	Persistent bytecode cache
BWK awk (one-true-awk)	✗	✗	✗
gawk	✓	✗	✗
mawk	✓	✗	✗
goawk	✓	✗	✗
frawk	✓	✓ (Cranelift + LLVM)	✗
zawk (frawk fork)	✓	✓ (Cranelift + LLVM)	✗
awkrs	✓	✓ (fusevm/Cranelift)	✓

Invocations of the form awkrs -f script.awk ... memoize the compiled CompiledProgram to an rkyv-archived shard at ~/.awkrs/scripts.rkyv. The cache hit path is mmap + zero-copy ArchivedHashMap lookup + bincode-decode of the matched entry's inner blob. Same outer-rkyv / inner-bincode pattern used by zshrs and stryke.

# 1st run: parse + compile + populate ~/.awkrs/scripts.rkyv
awkrs -f script.awk input.txt

# 2nd run: cache hit, skips parse/compile
awkrs -f script.awk input.txt

# disable the cache for one run
AWKRS_CACHE=0 awkrs -f script.awk input.txt

# wipe the cache
rm ~/.awkrs/scripts.rkyv

Invalidation is automatic and silent:

Source mtime change — editing script.awk causes the next run to miss and recompile.
Binary mtime newer than entry — rebuilding awkrs (any cargo install / cargo build) invalidates every entry so old bytecode never runs against new code.
Schema / version drift — package version, format-version byte, and host pointer width are all validated; a fresh shard is written on mismatch.

Engagement criteria — the cache only kicks in for the simple -f script.awk form. The following skip the cache because they need the AST or are short-lived modes:

inline programs (-e/--source, -E, or bare-arg form like awkrs '{print $1}' file)
multi-source assembly (-i/--include, -l/--load, or multiple -f)
AST-only flags: --debug, --lint, --lint-old, --pretty-print, --gen-pot

Storage — single rkyv archive, atomic-rename writes, flock(LOCK_EX) on a sibling lockfile so concurrent awkrs processes serialize their writes. Reads are unlocked, mmap'd, and rely on rkyv's check_archived_root byte-validation plus the magic/version header.

Invention — First AWK with a Language Server + Debug Adapter

Based on a survey of the major public awk implementations (BWK awk, gawk, mawk, goawk, frawk, zawk), awkrs appears to be the first awk to ship a Language Server (LSP) and a Debug Adapter (DAP) as first-class subsystems of the awk binary itself. No other awk ships either: gawk has only the unrelated --debug command-line debugger (no DAP, no editor protocol), and there is no awk language server in any of the others.

awkrs --lsp — starts a Language Server on stdio: completion (keywords, built-in variables, built-in functions), hover, diagnostics from the parser/linter, document symbols, and formatting — consumed by any LSP editor.
awkrs --dap — speaks the Debug Adapter Protocol on stdio: breakpoints, step over/into/out, call stack, scopes, variables, and watch/evaluate, so any DAP client can debug an AWK program.
Each is invoked with only its flag (no appended --stdio); the wiring is guarded so editors degrade cleanly if the binary is absent.

Three first-party editor plugins drive them — vscode-awk (LSP client + DAP debug config), vim-awk (vim-lsp / coc + nvim-dap), and emacs-awk (eglot + lsp-mode) — so AWK gets the modern IDE surface (live diagnostics, completion, breakpoint debugging) that no other awk has had.

gawk Extensions

awkrs implements many gawk extensions for compatibility:

BEGINFILE / ENDFILE — run before/after each input file
nextfile — skip to next input file
@include — include another awk file
@namespace — namespace support
Typed regex — @/regex/ strongly typed regex constants
Indirect function calls — @func_name()
Two-way pipes — |& for coprocess communication
Network I/O — /inet/tcp/... special files
Time functions — systime(), mktime(), strftime()
Bit operations — and(), or(), xor(), etc.

Environment Variables

Variable	Effect
`AWKRS_CACHE=0\|false\|no`	Disable the `~/.awkrs/scripts.rkyv` bytecode cache for the run
`AWKRS_FUSEVM=0`	Force the bytecode interpreter for every chunk (disable fusevm offload)
`AWKRS_JIT=0`	Disable JIT/peephole optimization (same effect as `-s`/`--no-optimize`)
`FUSEVM_JIT_CACHE_DIR`	Override the fusevm machine-code cache dir (`~/.cache/fusevm-jit`); `off` disables it
`FUSEVM_JIT_CACHE_MAX_BYTES`	Cap the fusevm machine-code cache size
`NO_COLOR`	Force plain-text help (no ANSI colors)
`LC_NUMERIC, LC_COLLATE, LC_ALL`	Locale for `-N` number formatting / `%'` grouping and `strcoll` ordering

Troubleshooting

Output reordered or program runs single-threaded under -j — the program isn't parallel-safe (range pattern, exit, primary getline, redirection, etc.). awkrs falls back to sequential execution and warns; remove the offending construct or accept sequential mode.
Stale results after editing a script — the cache invalidates on source mtime change and on a newer awkrs binary, but you can force a clean run with AWKRS_CACHE=0 awkrs -f script.awk … or wipe ~/.awkrs/scripts.rkyv.
division by zero attempted — division and modulo by zero are fatal (gawk semantics), not inf/NaN. Guard the divisor.
Whole-array used in scalar context fatals — print a (where a is an array) is a fatal runtime error, matching gawk; index the array instead.
Build needs a C compiler — -M/--bignum pulls in gmp-mpfr-sys via rug, which needs a C compiler and make.

Repository & Links

Engineering report — report.html (architecture, perf stack, test coverage, competitive landscape)
Source — github.com/MenkeTechnologies/awkrs
Crate — crates.io/crates/awkrs (cargo install awkrs)
Rust API docs — docs.rs/awkrs
Issues — github.com/MenkeTechnologies/awkrs/issues
Parity tests — parity/ holds 2,116 cases run byte-for-byte against gawk, mawk, and BSD awk on every push (2,054 POSIX-portable across cases/ + cases_portable/, 62 gawk-only in cases_gawk/ for typeof, FPAT, gensub, **=, strftime, bit ops, arity messages).