stryke-spark — Documentation

>_STRYKE-SPARK

Distributed compute from a stryke one-liner. Apache Spark client for stryke. Opt-in package, kept out of the stryke core binary so the daily-driver install stays slim. Ships as a Rust cdylib that stryke dlopens in-process on first use Spark; the cdylib drives Spark through spark-submit with an embedded PySpark driver (src/driver.py, compiled in via include_str!).

Install

# from a release (no rustc on the consumer machine)
s pkg install -g github.com/MenkeTechnologies/stryke-spark

# from a local checkout
cd ~/projects/stryke-spark
cargo build --release            # produces target/release/libstryke_spark.{dylib,so}
s pkg install -g .               # cdylib lands in ~/.stryke/store/spark@<version>/

# one-liner
make install

The cdylib is dlopened in-process on first use Spark — no helper-binary fork per call. You also need spark-submit reachable (e.g. brew install apache-spark, your distro's package, or a tarball with $SPARK_HOME set). Any stryke script that declares use Spark resolves the package automatically.

Honest scope note. Each call still pays SparkSession init cost (seconds, dominated by JVM warmup). A long-running JVM driver daemon that persists SparkSession across calls is deferred — it needs a sidecar process design larger than the helper-binary → cdylib refactor. What the cdylib model eliminates is the helper-binary fork+exec overhead on top of spark-submit.

JDK compatibility

Spark 4.x officially supports JDK 17 — JDK 21+ trips a getSubject is not supported error in the Hive catalog code path even under local[*]. The cdylib defaults to --conf spark.sql.catalogImplementation=in-memory to dodge Hive, but a JDK 17 environment is still the smoothest. Set JAVA_HOME before running:

export JAVA_HOME=/path/to/jdk-17     # e.g. corretto-17, temurin-17

Quick start: `use Spark`

plain query (local[*])	`my @rows = Spark::query "SELECT id, id * 2 AS doubled FROM range(5)"`
remote cluster	`Spark::query "SELECT * FROM events", master => "spark://cluster:7077"`
scalar shortcut	`p Spark::query_scalar "SELECT COUNT(*) FROM range(1000000)"`
DDL	`Spark::execute "CREATE TABLE IF NOT EXISTS logs (ts TIMESTAMP, msg STRING)"`
schema	`p to_json Spark::schema "logs"`
table / database listings	`p Spark::tables \|> ep; p Spark::databases \|> ep`
spark-submit pass-through	`Spark::submit "jobs/etl_pipeline.py", args => ["--date", "2026-01-01"]`

Each Spark call spins up a fresh JVM (~5–10s warmup). For multi-statement work, prefer one SQL with CTEs / subqueries over many separate calls. The complete API also lives in the README "API reference" section.

API reference

Every public Spark::* function in lib/Spark.stk, grouped by role. The read paths, DDL/DML, external read/write, metadata, catalog introspection, temp views / session, caching + config, and submit groups call the JVM through spark-submit; the pure helpers group runs entirely in-process (no Spark, no JVM) and is safe to use anywhere.

Read paths

Spark::query        $sql, %opts → @rows                     # SELECT → rows as hashrefs
Spark::query_stream $sql, %opts → $count                   # callback => sub fires per row, returns count
Spark::query_one    $sql, %opts → \%row | undef            # SELECT … LIMIT 1
Spark::query_col    $sql, %opts → @values                  # first column as a flat list
Spark::query_scalar $sql, %opts → $value | undef          # first cell of LIMIT 1
Spark::dump         $table, %opts → @rows                  # SELECT * FROM $table (columns/where/order_by/limit opts)
Spark::count        $table, $where?, %opts → $row_count    # SELECT count(*) [WHERE $where]

DDL / DML

Spark::execute $sql, %opts → { ok: true }                  # CREATE / INSERT / DROP / MERGE … (one ack on success)
Spark::explain $sql, %opts → $plan_text                    # opts: mode (simple|extended|codegen|cost|formatted)

External read / write

Spark::read  $path, %opts → @rows                          # opts: format, options, view, sql, limit
Spark::write $sql,  %opts → { ok, … }                      # opts: path|table, format, mode, options

read loads a parquet/csv/json/orc source (default parquet); pass view => "v", sql => "SELECT … FROM v" to query it in the same call. write runs $sql and saves the result to a path or table, with mode ∈ overwrite|append|ignore|errorifexists.

Metadata

Spark::ping            %opts → 1 | 0                        # connectivity probe
Spark::tables          %opts → @rows                       # catalog table rows
Spark::databases       %opts → @rows
Spark::schema          $table, %opts → @rows               # DESCRIBE TABLE column rows
Spark::columns         $table, %opts → @rows               # name/type/nullable/is_partition/is_bucket
Spark::functions       %opts → @rows                       # catalog functions
Spark::version()       → $version                          # package version string

Catalog introspection + session

Spark::views            %opts → @rows                      # { name, database, is_temp, type }
Spark::catalogs         %opts → @rows                      # { name, description }
Spark::current_database %opts → $name
Spark::create_temp_view $name, $sql, %opts → \%resp        # createOrReplaceTempView
Spark::drop_temp_view   $name, %opts → \%resp             # { ok, dropped }
Spark::set_database     $database, %opts → \%resp          # catalog.setCurrentDatabase
Spark::refresh_table    $table, %opts → \%resp            # catalog.refreshTable

Caching + runtime config

Spark::cache   $table, %opts → { ok, cached }
Spark::uncache $table, %opts → { ok, uncached }
Spark::config  $key, %opts → $value | { ok }              # set with value => …

Submit pass-through

Spark::submit $script_path, args => [...], %opts → { exit_code, output }

Runs a .py / .jar workload through spark-submit — the escape hatch for jobs outside the SQL surface.

Pure helpers (no Spark, no JVM)

Parsers and builders for the Spark string formats that show up in configs, URLs, identifiers, and DDL. Each parse_* has an inverse build_* that round-trips it.

# master URLs
Spark::parse_master_url($url)   → { scheme, threads?, hosts?, master? }   # local[N], spark://… HA, k8s://…, yarn
Spark::build_master_url(%opts)  → $url                                   # inverse

# table identifiers
Spark::parse_table_name($name)  → { catalog, database, table, parts }     # backtick-aware catalog.db.table
Spark::build_table_name(%opts)  → $name                                   # inverse
Spark::quote_qualified_ident($name) → $quoted                            # cat.db.my table → `cat`.`db`.`my table`
Spark::parse_qualified_ident($name) → { name, parts, count }             # inverse; dot inside backticks stays literal
Spark::quote_ident($name)       → $quoted                                # always wraps; embedded backticks doubled
Spark::unquote_ident($quoted)   → $name                                  # inverse of quote_ident
Spark::quote_ident_if_needed($name) → $quoted                            # quoteIfNeeded: users → users, 1col → `1col`

# string literals + LIKE
Spark::quote_literal($value)    → $quoted                                # it's → 'it\'s' (Spark string-literal escapes)
Spark::unquote_literal($quoted) → $value                                 # inverse, single left-to-right escape pass
Spark::escape_like($value)      → $escaped                               # 100% → 100\%, a_b → a\_b, \ → \\
Spark::unescape_like($pattern)  → $literal                               # inverse; rejects an unescaped wildcard

# sizes + durations
Spark::parse_memory($memory)    → { value, suffix, bytes, mib }          # 512m/2g/1kb → bytes (binary: 1kb=1024)
Spark::build_memory($bytes)     → { value, suffix, string, bytes }       # inverse; largest binary unit dividing evenly
Spark::convert_memory($memory, $to) → { string, value, suffix, bytes }   # re-express in unit $to (b/k/m/g/t/p)
Spark::parse_duration($duration) → { value, suffix, ms, unit }           # 30s/5min/100ms/1h/2d → ms
Spark::build_duration($ms)      → { value, suffix, string, ms }          # inverse; largest unit dividing evenly

# partitions
Spark::build_partition_spec(\@pairs) → $clause                           # [col,val] pairs → PARTITION (col=val, …) DDL
Spark::parse_partition_spec($spec)   → { partitions:[{column,value}] }   # inverse; quote-aware split
Spark::parse_partition_path($path)   → { partitions:[{column,value}] }   # …/year=2024/month=01/ → segments
Spark::build_partition_path(\@pairs) → $path                            # inverse

# confs
Spark::parse_conf($conf)        → { key, value }                         # split "key=value" on the first "="
Spark::build_conf(%opts)        → $line                                  # inverse
Spark::merge_confs(\@confs)     → { map, confs }                         # last-wins dedup, first-seen key order

# data types
Spark::parse_data_type($type)   → { base, args, params }                 # decimal(10,2) → ["10","2"]; depth-aware
Spark::build_data_type(%opts)   → $type                                  # inverse; (…) scalars, <…> generics

# storage levels
Spark::parse_storage_level($name) → { name, use_disk, use_memory, use_off_heap, deserialized, replication }
Spark::build_storage_level(%opts) → { name, … }                          # inverse of the flag tuple

# packages, JDBC, app IDs, SQL splitting
Spark::parse_maven_coordinate($coord) → { group, artifact, version, coordinate }   # one g:a:v coord
Spark::split_packages($packages) → { packages:[{group,artifact,version,coordinate}], count }
Spark::parse_jdbc_url($url)      → { subprotocol, subname, host, port, database, params }
Spark::build_jdbc_url(%opts)     → $url                                  # inverse
Spark::parse_app_id($id)        → { kind, id, … }                        # standalone / yarn / local app IDs
Spark::split_sql_statements($sql) → { statements:[…], count }            # split on top-level ";" (quote/comment aware)

Connection options (`%opts`)

Every Spark-backed function takes a trailing %opts splat. These connection keys are merged into the FFI request by Spark::_req and forwarded to the driver:

master	Spark master URL, e.g. `local[*]` (default), `spark://cluster:7077`
spark_home	override `$SPARK_HOME` for this call
spark_submit	path to the `spark-submit` binary
app_name	Spark application name
deploy_mode	`client` / `cluster`
packages	Maven coordinates for `--packages`
jars	extra JARs for `--jars`
database	database to use for the session
confs	hashref of `spark.*` config → `--conf`
limit	row cap on read paths
callback	`query_stream` only — `sub { … }` fired per row

Type encoding

Spark df.toJSON() does the heavy lifting; types map to JSON as Spark's JSON serializer dictates.

boolean	`bool`
byte, short, int, long	`number`
float, double	`number`
decimal(p,s)	`number` (precision permitting)
string, varchar, char	`string`
binary	base64 `string`
date	`"yyyy-MM-dd"`
timestamp	`"yyyy-MM-dd HH:mm:ss"`
array<T>	JSON array
struct<…>	JSON object
map<K,V>	JSON object
NULL	`null`

The columnar path also coerces Python date/datetime/Decimal to strings if Spark's serializer leaves them as native Python objects.

Bind parameters

Spark SQL doesn't accept positional binds the way Postgres / MySQL do (the 3.5+ args= keyword on SparkSession.sql is gated on Connect for some deployments). For v1, inline values into the SQL string — use Spark::quote_literal for strings, and Spark SQL literal forms for the rest (numeric, DATE '2026-01-01', etc.). Bind support via the cdylib's request JSON can be added once a clean cross-version path exists.

Examples

Runnable scripts under examples/ — each is read-only / eval-guarded so it exits cleanly when no Spark driver is reachable.

discover.stk	minimum-viable tour: version, ping, `query_scalar`, `query_col`
quick_query.stk	a CTE query with `ORDER BY`, iterating returned rows
range_stats.stk	`query_one` aggregate (COUNT/SUM/AVG/MIN/MAX) over `range(N)`
sql_explain.stk	`EXPLAIN` a query, then run it and print rows
parquet_pipeline.stk	pairs with stryke-arrow: Arrow writes parquet, Spark reads it via `read_files()`

# range_stats.stk — aggregate over a generated range
val $r = Spark::query_one "
    SELECT COUNT(*) AS rows, SUM(id) AS total, AVG(id) AS mean,
           MIN(id) AS min_id, MAX(id) AS max_id
    FROM range($n)
"
p "n=#{$r->{rows}} sum=#{$r->{total}} mean=#{$r->{mean}}"

Performance notes

Each call boots a fresh JVM via spark-submit. Plan for ~5–10s startup per call.
Batch work into one query with CTEs / subqueries / temp views when possible — that's a single submit, one JVM.
Local Spark warehouse files land under ./spark-warehouse/ and a metastore_db/ directory in the cwd. Both are in .gitignore.
For interactive work against a remote cluster, point --master at a long-running standalone / YARN / k8s Spark cluster — the submit time is the same but the actual compute runs on warm executors.

Tests

cargo test                                  # Rust unit tests, no live JVM
JAVA_HOME=/path/to/jdk-17 s test t/         # end-to-end against local[*]

The end-to-end suite skips cleanly when spark-submit isn't on PATH or the JVM can't start. The repo also carries shell lint gates under tests/*.sh (HTML structure, README and man-page hygiene, workflow formatting).

Why a package, not a builtin

Spark integration requires the JVM, spark-submit, and PySpark on the host. Most stryke one-liners never touch Spark; for the ones that do, opt in with this package.

The stryke side is a thin .stk wrapper that calls spark__* FFI symbols on the cdylib; the heavy code lives in libstryke_spark.{dylib,so}, dlopened in-process on first use Spark. Core stryke is never linked against this package's deps.

FFI layer

Each Spark::* wrapper builds a JSON args dict and calls a sibling spark__* symbol resolved out of libstryke_spark.{dylib,so}. The cdylib is dlopened in-process on first use Spark; per call it writes the embedded PySpark driver (src/driver.py, included via include_str!) to a temp file and runs it through spark-submit, passing the JSON request envelope and reading JSON rows back. Each call boots a fresh JVM / SparkSession (~5–10s warmup); there is no persistent cross-call session daemon in this release. Responses are JSON; errors come back as {"error": "<msg>"} and the wrapper dies with the message.

Layout

stryke-spark/
├── Cargo.toml             # cdylib crate manifest (crate-type = ["cdylib"], publish = false)
├── src/
│   ├── lib.rs            # cdylib — spark__* extern "C" exports
│   └── driver.py         # embedded PySpark driver (include_str!)
├── lib/                   # stryke .stk wrapper(s) — `use Spark`
├── stryke.toml            # stryke package manifest ([ffi] exports table)
├── t/                     # zunit-style tests
├── examples/              # runnable .stk examples
├── Makefile               # `make install` builds + installs
└── docs/                  # this site (GitHub Pages)

Sibling packages

Part of the stryke package family. Browse the others via the MenkeTechnologiesMeta umbrella repo:

stryke-arrow — Apache Arrow / Parquet / Feather
stryke-aws — S3, DynamoDB, SQS, Lambda, STS
stryke-azure — Blob, Queues, Cosmos DB, Key Vault, Entra
stryke-clickhouse — ClickHouse
stryke-demo — live demos for every connector
stryke-docker — Docker daemon API
stryke-duckdb — embedded DuckDB SQL engine
stryke-email — email + campaign client
stryke-fleet — parallel expect/PTY automation
stryke-gcp — Cloud Storage, Pub/Sub, Secret Manager, BigQuery, Firestore
stryke-grpc — reflection-based gRPC client
stryke-gui — mouse / keyboard / screen / clipboard automation
stryke-k8s — Kubernetes
stryke-kafka — Apache Kafka
stryke-mcpd — MCP servers without a runtime
stryke-mongo — MongoDB
stryke-mssql — Microsoft SQL Server
stryke-mysql — MySQL / MariaDB
stryke-neo4j — Neo4j graph (Bolt)
stryke-office — Office docs / PDF / images
stryke-parquet — Parquet file inspector
stryke-polars — Polars + ndarray + linalg + FFT
stryke-postgres — PostgreSQL
stryke-redis — Redis / Valkey
stryke-scrape — web scraping / crawling
stryke-scylla — ScyllaDB / Cassandra
stryke-search — Elasticsearch / OpenSearch
stryke-selenium — browser automation (WebDriver)
stryke-utils — pure-stryke utility belt
stryke-zmq — ZeroMQ messaging