What leaves my environment?
Anonymized code structure, baseline Spark metrics, and your cluster shape (executors, memory, runtime versions). Identifiers, string literals, file paths, and credentials stay local, replaced with tokens before upload and restored when candidates come back.
How is this different from a general coding agent?
Three ways. First, the surface area is small: spark-optimize can only propose code and Spark config changes inside the one job you point it at, never shell commands, dependency installs, or other files. Second, every candidate is AST-validated locally before it runs to ensure it stays inside that surface. Third, unlike general agents, spark-optimize runs an autonomous evaluation loop on the same job (generating, scoring, and discarding) until it converges on a measurable win or its candidate budget exhausts, which could take hours, all without getting distracted.
Could the agent ever make a job slower?
No. Each candidate runs under a timeout derived from your baseline plus a small margin, so a candidate that runs longer than the baseline is killed before it can be considered a winner. Worst case, a run explores its budget without finding a faster candidate and exits cleanly. Your existing job is untouched.
How do you prevent "faster but wrong" changes?
The CLI records the baseline run's terminal actions (writes, collect, show, count) and reruns each candidate against the same inputs. Row counts, schema, and write payloads are compared action by action. A candidate is only promoted when every action's output matches the baseline.
Do you optimize code, configuration, or both?
Both. The product is designed to optimize PySpark code and Spark configuration together for the same job.
Is this a sandbox?
No, and we don't pretend it is. spark-optimize is a constrained agent by design: no shell, no tool calls, no foreign filesystem access. On top of that, three layered checks: anonymization before upload, AST validation before execution, output equivalence before acceptance. Layered safety beats claimed sandbox isolation.
Can it validate write-heavy ETL jobs?
Yes. The CLI redirects terminal writes (Delta, Parquet, and similar) to scratch storage and compares baseline and candidate outputs action by action across row counts, schema, and write payloads. A candidate is only promoted when equivalence is verified end-to-end.
What do I get at the end of a run?
Two reviewable artifacts: `job.optimized.py` (the rewritten PySpark) and `job.optimized.conf.json` (the Spark configuration). They drop straight into your existing code-review and release process. Nothing is auto-promoted.
Can I run this against local, staging, or production?
Yes. Candidate evaluation runs in whatever environment you point the CLI at (local, staging, or production). The winning candidate reflects real runtime conditions because it was measured against your real cluster.
What runtimes and platforms are supported today?
PySpark 3.4+ on Python 3.10+. Today: Databricks (DBR 13+ with Unity Catalog), AWS EMR 6.x, and open-source Spark on YARN or Kubernetes, with Delta Lake read/write validation. On request: Databricks/EMR Serverless, Google Dataproc, Azure Synapse/Fabric, Spark Connect, and Iceberg/Hudi write validation. Not in scope today: Structured Streaming, Delta Live Tables, and Scala/Java Spark jobs.
If I run spark-optimize twice on the same job, do I get the same winner?
Not exactly. LLM-driven candidate generation has natural variance, and runtime measurements vary with cluster noise, so close candidates can rank differently across runs. The validation checks are the same regardless of which run produces the winner, so any winner you get is held to the same correctness bar.
How much compute does an optimization run use?
You set a budget for the number of candidates the agent will try. Within that budget, the agent explores the search space looking for bigger wins: a tighter budget runs faster and cheaper, a wider budget gives the agent more room to find structurally better optimizations. The agent never runs forever, and a run that exhausts its budget without finding a winner exits cleanly.
My jobs take date ranges and partition filters. Does the winner generalize?
The winner is specific to the baseline you measured. For production jobs with variable inputs, point the CLI at a representative date range or partition and re-run periodically as data volume shifts. The artifacts are plain Python and JSON, so nothing stops you from promoting the winner through your normal code-review and release process.
Does this work on Structured Streaming or long-running jobs?
Today it targets batch PySpark jobs with bounded inputs. Structured Streaming, Delta Live Tables, and continuously-running pipelines are out of scope for now. The validation approach depends on a baseline run that terminates.