stryke-spark — Engineering Report

>_EXECUTIVE SUMMARY

stryke-spark is one of the opt-in connector packages in the stryke ecosystem. Apache Spark client for stryke. Opt-in package, kept out of the stryke core binary so the daily-driver install stays slim — ships as a Rust cdylib that stryke dlopens in-process on first use Spark.

Spark integration requires the JVM, spark-submit, and PySpark on the host. The cdylib drives Spark through spark-submit with an embedded PySpark driver (src/driver.py, compiled in via include_str!); universal across Spark 3.x and 4.x.

cdylib

Crate type

opt-in

Tier

dlopen

Load model

JSON-over-FFI

Envelope

Public Spark::* fns

spark__* FFI exports

Pure helper fns (no JVM)

Runnable examples

Counts are from lib/Spark.stk (public Spark::* functions, of which 35 are pure in-process helpers needing no JVM), the [ffi] exports table in stryke.toml, and examples/. The cdylib links only serde, serde_json, anyhow, tempfile, and once_cell (Cargo.toml); the JVM, spark-submit, and PySpark are host requirements, not crate dependencies.

~ARCHITECTURE

In-process cdylib design: the stryke side is a thin .stk wrapper that calls FFI symbols on the dlopen'd library. No subprocess between stryke and the cdylib, no pipe — the FFI call returns on the same thread that made it. The cdylib in turn drives Spark by invoking spark-submit with an embedded PySpark driver.

Layer	Implementation
stryke wrappers (`lib/*.stk`)	Thin `Spark.stk` wrapper; each public fn serializes its args to JSON and parses the response
cdylib (`libstryke_spark.{dylib,so}`)	Single Rust `cdylib` crate; every public surface fn is an `extern "C" fn spark__<verb>(const c_char) -> mut c_char`; drives Spark via `spark-submit` + embedded `src/driver.py` (`include_str!`)
Process model	Library is `dlopen`'d once per stryke session on first `use Spark`; each call boots a fresh JVM / `SparkSession` via `spark-submit` (~5–10s warmup) — no persistent cross-call session daemon in this release
Build	Cargo with `[lib] crate-type = ["cdylib"]`; `publish = false`; the package itself ships via `s pkg install -g .`
Install path	`~/.stryke/store/spark@<version>/` after `make install`; `use Spark` from any stryke script resolves it
Tests	zunit-style under `t/` with live-service variants when applicable
CI	GitHub Actions `.github/workflows/ci.yml` — cargo fmt + clippy + test, plus stryke pkg install verification

$WHY OPT-IN (NOT BUILTIN)

Spark integration requires the JVM, spark-submit, and PySpark on the host. Most stryke one-liners never touch Spark; for the ones that do, opt in with this package.

The trade-off is intentional. The core stryke binary stays under ~40 MB precisely because each connector ships separately. Daily-driver work (one-liners, awk replacement, data scripting) doesn't need MongoDB drivers, AWS SDKs, or Spark Connect bindings linked in.

&FFI ENVELOPE

JSON in / JSON out across the FFI boundary. Each Spark::* wrapper serializes its args to a JSON dict and calls the matching spark__* symbol on the dlopen'd cdylib; the cdylib runs the embedded PySpark driver through spark-submit and returns a JSON CString the wrapper parses. catch_unwind on every call converts panics to JSON errors. The returned CString is freed via the cdylib-exported stryke_free_cstring, wired automatically by rust_ffi::load_cdylib.

# request shape (built by the .stk wrapper)
{"verb": "<op>", "...": ...}

# response shape
{"rows": [...]} | {"value": ...} | {"ok": true}   # success — per-fn shape
{"error": "<msg>"}                                # failure — wrapper die()s with the message

=PUBLIC SURFACE INVENTORY

The 64 public Spark::* functions split into JVM-backed groups (each maps to a spark__* FFI export driven through spark-submit) and a pure-helper group that runs entirely in-process.

Group	Functions	Response shape
Read paths	`query`, `query_stream`, `query_one`, `query_col`, `query_scalar`, `dump`, `count`	`{rows:[…]}` → rows; scalar/col/one derived stryke-side; `count` composes `query_scalar`
DDL / DML	`execute`, `explain`	`{ok:true}` ack; `explain` returns `{plan}`
External I/O	`read`, `write`	`{rows:[…]}`; `{ok, …}`
Metadata	`ping`, `tables`, `databases`, `schema`, `columns`, `functions`, `version`	`{ok}` / `{rows:[…]}` / `{version}`
Catalog + session	`views`, `catalogs`, `current_database`, `create_temp_view`, `drop_temp_view`, `set_database`, `refresh_table`	`{rows:[…]}` / `{database}` / `{ok, …}`
Caching + config	`cache`, `uncache`, `config`	`{ok, cached\|uncached}`; `{value}` or `{ok}`
Submit	`submit`	`{exit_code, output}`
Pure helpers (no JVM)	35 fns: `parse_` / `build_` for master URLs, table/qualified idents, string literals, LIKE patterns, memory sizes, durations, partitions, confs, data types, storage levels, Maven coords, JDBC URLs, app IDs, SQL statement splitting	parser hashref ↔ builder string; every `parse_` has an inverse `build_`

Five of the pure helpers (quote_literal, build_partition_spec, build_partition_path, build_storage_level, build_data_type) are implemented pure-stk — their backslash-bearing output would be mangled crossing the JSON FFI, so they compose other helpers in lib/Spark.stk rather than calling a spark__* op. That is why the public count (64) exceeds the FFI export count (57).

@JDK + RUNTIME REQUIREMENTS

Spark 4.x officially supports JDK 17; JDK 21+ trips a getSubject is not supported error in the Hive catalog path even under local[*]. The cdylib defaults to --conf spark.sql.catalogImplementation=in-memory to dodge Hive, but a JDK 17 environment is smoothest — set JAVA_HOME before running. spark-submit must be reachable (via brew install apache-spark, a distro package, or a tarball with $SPARK_HOME set). Each call boots a fresh JVM / SparkSession (~5–10s warmup); local runs write a ./spark-warehouse/ and metastore_db/ in the cwd (both gitignored). Batch into one query with CTEs / temp views to amortize the single submit.

/SCOPE

See the README's "Why this is a package" + "API reference" sections for the authoritative scope. The package is intentionally narrower than its underlying SDK / driver — the goal is "useful from a shell pipeline", not "complete API coverage".

#PROJECT METADATA

Item	Value
License	MIT
Author	MenkeTechnologies
Repository	github.com/MenkeTechnologies/stryke-spark
Parent language	strykelang
Meta umbrella	MenkeTechnologiesMeta
Issues	github.com/MenkeTechnologies/stryke-spark/issues