stryke-parquet — Documentation

>_STRYKE-PARQUET

See into parquet without loading it. Parquet file inspector for stryke — schema, footer stats, row-group breakdown, head/tail, recompression. Diagnostic counterpart to stryke-arrow.

Install

# from a release (no rustc on the consumer machine)
s pkg install -g github.com/MenkeTechnologies/stryke-parquet

# from a local checkout
cd ~/projects/stryke-parquet
cargo build --release            # produces target/release/libstryke_parquet.{dylib,so}
s pkg install -g .               # cdylib lands in ~/.stryke/store/parquet@<version>/

# one-liner
make install

The cdylib is dlopened in-process on first use Parquet — no helper-binary fork per call. Any stryke script that declares use Parquet resolves the package automatically.

Quick start: `use Parquet`

Declare use Parquet once; every Parquet::* helper below resolves the cdylib automatically. Footer-only calls (inspect, schema, count, rowgroups, stats, metadata, dtypes, size_report, null_summary, encoding_summary, bloom_filter_summary, sorting_columns_summary, features, schema_diff) never scan rows.

inspect footer (no row scan)	`p to_json Parquet::inspect "events.parquet"`
row count	`p Parquet::count "events.parquet"`
schema	`p to_json Parquet::schema "events.parquet"`
column logical types	`p to_json Parquet::dtypes "events.parquet"`
per-row-group breakdown	`val @rgs = Parquet::rowgroups "events.parquet"`
per-column stats	`Parquet::stats "events.parquet", column => "user_id"`
first 20 rows	`val @first = Parquet::head "events.parquet", n => 20`
last 5 rows	`val @last = Parquet::tail "events.parquet", n => 5`
filter rows (in memory)	`val @hits = Parquet::filter "events.parquet", "score", "gt", value => 90`
one column as a flat list	`val @ids = Parquet::column "events.parquet", "user_id"`
numeric aggregate	`p Parquet::mean "events.parquet", "score"`
group + aggregate	`val @g = Parquet::group_by "events.parquet", "level", func => "sum", agg => "score"`
recompress with zstd	`Parquet::compress "events.parquet", "events.zst.parquet", codec => "zstd"`
validate a file end-to-end	`val $v = Parquet::validate "events.parquet" # { ok, rows, row_groups }`
per-row-group column chunk detail	`val @rg = Parquet::column_chunk_stats "events.parquet"`
arbitrary row window	`val @win = Parquet::sample "events.parquet", offset => 1000, n => 20`
index / bloom-filter presence	`val $f = Parquet::features "events.parquet"`

A fuller runnable walk-through (synthesize via stryke-arrow, then inspect) lives in examples/discover.stk; diagnostics in examples/diagnostics.stk.

Inspection & schema

Footer-only reads — cheap regardless of file size; no row data is decoded.

`Parquet::version`	package version string — `{ version }`
`Parquet::inspect $path`	footer summary — rows, row-group count, columns, dominant compression, writer version
`Parquet::schema $path`	`{ fields, num_fields }`
`Parquet::dtypes $path`	each column's Arrow logical type — `{ num_fields, columns:[{name, dtype, nullable}] }`
`Parquet::count $path`	row count `$n` from footer metadata
`Parquet::rowgroups $path`	`@rgs` — one record per row group
`Parquet::row_group_summary $path`	footer-only sizing rollup — rows/group + compressed bytes/group min/max/mean
`Parquet::stats $path, %opts`	per-column footer stats; `%opts`: `column`
`Parquet::metadata $path`	writer kv metadata + `created_by` + version
`Parquet::schema_diff $base, $other`	`{ equal, added, removed, type_changed, base_only, other_only }`
`Parquet::column_names $path`	`\@names` — schema field names in file order (composite over `schema`)
`Parquet::column_count $path`	number of columns `$n` (composite)
`Parquet::is_empty $path`	`1 \| 0` — `count == 0` (composite)

Reading rows

Materialize rows as stryke hashrefs. head/tail/slice/sample/reverse all accept a columns projection list.

`Parquet::head $path, %opts`	first `n` rows; `%opts`: `n`, `columns`
`Parquet::tail $path, %opts`	last `n` rows; `%opts`: `n`, `columns`
`Parquet::slice $path, %opts`	offset window; `%opts`: `offset`, `length` (to end if omitted), `columns`
`Parquet::sample $path, %opts`	arbitrary window; `%opts`: `offset`, `n`, `columns`
`Parquet::reverse $path, %opts`	every row in reverse file order (newest-first); `%opts`: `columns`
`Parquet::gather $path, \@indices, %opts`	rows by explicit 0-based index list (any order, repeats OK, out-of-range dies); `%opts`: `columns`
`Parquet::random_sample $path, %opts`	n random rows (reservoir, reproducible per `seed`); `%opts`: `n`, `seed`, `columns`; keeps file order
`Parquet::to_json $path, %opts`	all rows as records
`Parquet::stream $path, %opts`	invoke a `callback` per row (no full-result buffering); returns the count processed
`Parquet::column $path, $column`	a single column's cells as a flat list (polars Series)

Row predicates & reshaping (polars-style)

Materialize the file once, then transform the rows. Comparison is numeric for numbers, lexicographic for strings; nulls always sort last. Filter operators: eq ne gt ge lt le (and the symbolic = != > >= < <=), plus is_null / is_not_null.

`Parquet::filter $path, $column, $op, %opts`	rows where `$column OP $value`; `%opts`: `value`, `columns`
`Parquet::where_count $path, $column, $op, %opts`	count-only companion to `filter` (same grammar); `%opts`: `value`
`Parquet::filter_to $path, $dst, $column, $op, %opts`	write only matching rows to a new file (empty result errors); `%opts`: `value`, `codec`
`Parquet::distinct $path, %opts`	unique rows (first occurrence wins); `%opts`: `columns` (key + projection)
`Parquet::sort $path, $column, %opts`	ORDER BY `$column`; `%opts`: `descending => 1`, `columns`; nulls last, stable
`Parquet::top_k $path, $column, $k, %opts`	the `k` rows with the largest `$column` values; `%opts`: `descending => 0` for the smallest (bottom_k)
`Parquet::value_counts $path, $column`	`{ value, count }` per distinct value, sorted by count desc then value asc
`Parquet::group_by $path, $by, %opts`	GROUP BY `$by` — `{ key, count, value }`; `%opts`: `agg`, `func` (count/sum/min/max/mean)

Aggregates

Numeric reductions over a single column — nulls and non-numeric cells are skipped; an all-null column yields undef (never NaN).

`Parquet::sum $path, $column`	numeric sum `$sum`
`Parquet::min $path, $column`	numeric MIN `$min`
`Parquet::max $path, $column`	numeric MAX `$max`
`Parquet::mean $path, $column`	numeric AVG `$mean`
`Parquet::n_unique $path, $column, %opts`	COUNT(DISTINCT col); `%opts`: `include_nulls => 0` to exclude the null group (default 1)
`Parquet::quantile $path, $column, %opts`	linear-interpolated quantile; `%opts`: `q` in [0,1] (default 0.5)
`Parquet::median $path, $column`	sugar for `quantile(q => 0.5)`
`Parquet::describe $path`	per-column `{ count, null_count, min, max, mean, sum }`

Conversion & writing

Round-trip parquet against CSV / NDJSON / in-memory rows, recompress, and rewrite layout. Writers default to the zstd codec; pass codec => "<name>" to override.

`Parquet::to_csv $path, %opts`	parquet → CSV (scalar, or file via `output`); `%opts`: `output`, `columns`
`Parquet::to_ndjson $path, $dst`	parquet → NDJSON file (inverse of `from_json`)
`Parquet::from_csv $src, $dst, %opts`	CSV → parquet; `%opts`: `header`, `delimiter`, `codec`
`Parquet::from_json $src, $dst, %opts`	NDJSON → parquet; `%opts`: `codec`
`Parquet::write \@rows, $dst, %opts`	in-memory rows (hashrefs) → parquet; `%opts`: `codec`
`Parquet::write_partitioned \@rows, $dst, $column, %opts`	Hive `col=val/` dirs; `%opts`: `codec`
`Parquet::compress $src, $dst, %opts`	recompress; `%opts`: `codec`, `row_group`
`Parquet::repartition $src, $dst, %opts`	rewrite with a target max row-group row count; `%opts`: `row_group` (default 65536), `codec`
`Parquet::merge \@srcs, $dst, %opts`	concat same-schema files vertically; `%opts`: `codec`
`Parquet::hstack $path, $other, $dst, %opts`	horizontal append of `$other`'s columns at matching row count; name collision errors
`Parquet::select $path, $dst, \@cols, %opts`	project a column subset (column pruning); unknown column errors
`Parquet::drop $path, $dst, \@cols, %opts`	complement of `select`: keep all but `\@cols`; unknown / drop-all errors
`Parquet::rename $path, $dst, \%map, %opts`	relabel columns `{ old => new }`; preserves types/order/rows
`Parquet::with_row_index $path, $dst, %opts`	write a copy with a leading 0-based index column; `%opts`: `name` (default "index"), `offset` (default 0), `codec`

Diagnostics

validate reads every row group and reports failure as data (it never dies on a corrupt file — check ok). The footer rollups (size_report, null_summary, encoding_summary, bloom_filter_summary, sorting_columns_summary, column_chunk_stats, features) read metadata only.

`Parquet::validate $path`	`{ ok, rows, row_groups }` or `{ ok:false, stage, detail }`
`Parquet::column_chunk_stats $path`	per-row-group per-column chunk detail (compression, encodings, sizes, min/max, null_count)
`Parquet::size_report $path`	file + per-column compressed/uncompressed bytes, ratio, bytes_per_row; columns sorted by size desc
`Parquet::null_summary $path`	`{ num_rows, total_nulls, columns:[{column, null_count, null_fraction}] }`
`Parquet::encoding_summary $path`	footer-only physical-encoding rollup per column
`Parquet::bloom_filter_summary $path`	footer-only bloom-filter presence per column + totals
`Parquet::sorting_columns_summary $path`	footer-only declared sort order per row group
`Parquet::features $path`	`{ has_bloom_filter, has_column_index, has_offset_index, columns:[...] }`

The authoritative export list is [ffi].exports in stryke.toml; the full annotated reference (return shapes, JSON examples) is the README "API reference" section.

Compression codecs

Accepted by compress, repartition, merge, the from_* / write* writers, and the filter_to / select / drop / rename / with_row_index rewrite paths via codec => "<name>". Default is zstd.

`snappy`	parquet-rs default; fast, modest ratio (`snap`)
`zstd`	best ratio per CPU — the writer default (`zstd`)
`gzip`	broad compatibility (`flate2`)
`lz4`	LZ4_RAW frame (`lz4`)
`brotli`	high ratio, slow (`brotli`)
`uncompressed`	(alias `none`) fastest write, biggest file

Tests & dev workflow

# Rust unit tests (compile, no live parquet I/O)
cargo test

# self-contained round-trip suite — generates a fixture via mkdemo in /tmp/
s test t/

# dev shortcuts
make             # release build (cargo build --release)
make test
make install     # build + s pkg install -g .
make clean

The t/ suite generates a fixture parquet with mkdemo, then exercises the diagnostic surface against it — no external services required.

Why a package, not a builtin

Schema + footer / row-group introspection is a different workload from reading parquet into a typed DataFrame. Lightweight tool for ops + debugging.

The stryke side is a thin .stk wrapper that calls parquet__* FFI symbols on the cdylib; the heavy code lives in libstryke_parquet.{dylib,so}, dlopened in-process on first use Parquet. Core stryke is never linked against this package's deps.

FFI layer

Each Parquet::* wrapper builds a JSON args dict and calls a sibling parquet__* symbol resolved out of libstryke_parquet.{dylib,so}. The cdylib is dlopened in-process on first use Parquet (via stryke's pkg::commands::try_load_ffi_for resolver hook). Responses are JSON; errors come back as {"error": "<msg>"} and the wrapper dies with the message. Stateless package — parquet operations are file transforms; no process-level cache.

The exports span inspection (version, inspect, schema, dtypes, count, rowgroups, row_group_summary, stats, metadata, schema_diff), row read (head, tail, slice, sample, reverse, gather, random_sample, to_json, to_csv, to_ndjson, column), row predicates / reshaping (filter, where_count, filter_to, distinct, sort, top_k, value_counts, group_by), aggregates (sum, min, max, mean, n_unique, quantile, describe), conversion / write (from_csv, from_json, write, write_partitioned, compress, repartition, merge, hstack, select, drop, rename, with_row_index), and diagnostics (validate, column_chunk_stats, size_report, null_summary, encoding_summary, bloom_filter_summary, sorting_columns_summary, features). The authoritative list is [ffi].exports in stryke.toml.

Layout

stryke-parquet/
├── Cargo.toml             # cdylib crate manifest (crate-type = ["cdylib"], publish = false)
├── src/
│   └── lib.rs            # cdylib — parquet__* extern "C" exports
├── lib/                   # stryke .stk wrapper — `use Parquet`
├── stryke.toml            # stryke package manifest ([ffi] exports table)
├── t/                     # zunit-style tests
├── examples/              # runnable .stk examples
├── Makefile               # `make install` builds + installs
└── docs/                  # this site (GitHub Pages)

Sibling packages

Part of the stryke package family. Browse the others via the MenkeTechnologiesMeta umbrella repo:

stryke-arrow — Apache Arrow / Parquet / Feather
stryke-aws — S3, DynamoDB, SQS, Lambda, STS
stryke-azure — Blob, Queues, Cosmos DB, Key Vault, Entra
stryke-clickhouse — ClickHouse
stryke-demo — live demos for every connector
stryke-docker — Docker daemon API
stryke-duckdb — embedded DuckDB SQL engine
stryke-email — email + campaign client
stryke-fleet — parallel expect/PTY automation
stryke-gcp — Cloud Storage, Pub/Sub, Secret Manager, BigQuery, Firestore
stryke-grpc — reflection-based gRPC client
stryke-gui — mouse / keyboard / screen / clipboard automation
stryke-k8s — Kubernetes
stryke-kafka — Apache Kafka
stryke-mcpd — MCP servers without a runtime
stryke-mongo — MongoDB
stryke-mssql — Microsoft SQL Server
stryke-mysql — MySQL / MariaDB
stryke-neo4j — Neo4j graph (Bolt)
stryke-office — Office docs / PDF / images
stryke-polars — Polars + ndarray + linalg + FFT
stryke-postgres — PostgreSQL
stryke-redis — Redis / Valkey
stryke-scrape — web scraping / crawling
stryke-scylla — ScyllaDB / Cassandra
stryke-search — Elasticsearch / OpenSearch
stryke-selenium — browser automation (WebDriver)
stryke-spark — Spark Connect (no JVM)
stryke-utils — pure-stryke utility belt
stryke-zmq — ZeroMQ messaging