PySpark optimization

Automatically optimize your PySpark jobs to cut compute costs.

Our tool benchmarks the current job, tracks candidate PySpark code and Spark configurations, and returns the best performing combination that preserves existing behavior, all without sensitive data leaving your environment.

Optimization result

Illustrative PySpark job

The first question is whether a few expensive jobs hide meaningful compute savings . spark-optimize answers that with measured runs against your own baseline.

baseline 42m 18s
optimized 29m 37s

1.43× faster · 30% reduction

Code change summary

Added .persist() before the branching aggregation so the filtered frame is reused instead of recomputed, and switched the dimension join to a broadcast.

Spark config summary

AQE skew-join hint enabled; optimized writes disabled on the write-skewed Delta sink.

Validation checks
  • semantic equivalence
  • AST safe
  • write equivalence

Artifacts ready for review

  • job.optimized.py
  • job.optimized.conf.json

How it works

A drop-in replacement for spark-submit.

Run the same job you already run, but through spark-optimize. It baselines the current run, explores code and Spark configuration candidates, and returns the fastest validated result.

↻ iterates within your budget

Your environment

Execute & measure

baseline + candidate runs · Spark + environment metrics

Anonymize & restore

identifiers + literals ↔ tokens

Safety & semantic check

AST validation · output equivalence

Optimization agent

Candidate exploration

LLM-driven code + configuration generation

Multi-signal scoring

weighted: runtime · spill · shuffle · GC · skew

Cross-run tracking

history · mismatches · winners

Stays local: raw code · real identifiers · string literals · row data · credentials.

Maximum impact, minimum risk.

Variables and literals are anonymized. Code is AST-validated before execution. Outputs must match baseline before acceptance. The agent searches aggressively inside that envelope. The result is the biggest win that passes every check.

Efficient frontier exploration.

The agent scores each candidate across every signal that matters: runtime, spill, shuffle, GC, skew. A candidate stays alive only if no other beats it on all of them at once; the rest are pruned. The remaining frontier evolves toward your winner.

Job output correctness.

Each candidate's outputs are matched against your baseline action by action: row counts, schema, payloads. Non-deterministic operations fail closed rather than masking drift. Faster is worthless if the output shifted.

Narrower than a general coding agent.

spark-optimize can only propose code and config changes inside the one job you point it at. No new dependencies, no other files, no destructive actions. It loops on the same problem until it converges on a measurable win.

Why not do it yourself?

Optimizing Spark jobs is tedious and complex. We make it easier.

spark-optimize is not just a code rewrite. It baselines the job, explores code and Spark configuration candidates, rejects faster-but-wrong changes, and hands back artifacts your team can review.

waiting on run #4
18m elapsed est. 42m

Easy to get distracted waiting for candidate runs.

A full re-run per attempt. The context switch wins, every time.

code × config 12 tried · 3 faster
which one's best?

Tracking what's been tried gets impossible fast.

Code × config candidates stack up, and the winners hide in one engineer's scratch doc.

baseline 1,243,891 rows
candidate 1,243,887 rows Δ −4
28% faster · fails validation

Faster-but-wrong is the usual failure mode.

Joins and partition changes can silently shift row counts. Proving equivalence is harder than finding the win.

this sprint Mar 24
  • Q1 revenue backfill deadline
  • Events v2 pipeline deadline
  • Unity Catalog cutover deadline
  • Tune expensive Spark job later

The people who can tune are doing more urgent work.

Your Spark-depth engineers are already delivering this quarter's roadmap. Tuning has no deadline, so it waits.

Safe and secure by design

Layered checks, by design.

Precision safety: limited access and changes possible, with built-in verification and checks.

Limited blast radius no shell · no tool calls · no dependency installs · no foreign filesystem access

Code anonymization

Identifiers, literals, paths, and credentials are anonymized. Only the code's structure and Spark metrics travel upstream.

Static safety analysis

Every candidate passes a static AST check. Dangerous imports, dynamic execution, and filesystem access are blocked before the runtime sees the code.

Output equivalence check

Outputs are matched against your baseline action by action. Anything that drifts, times out, or fails to verify is rejected, never promoted.

Rejected before execution or acceptance

Dangerous importsDynamic code executionBare filesystem writesUnverifiable output deltasSlower than baselineNew imports or dependenciesDestructive operations

FAQ

Answers to the questions Spark teams ask before trying this.

What leaves my environment?

Anonymized code structure, baseline Spark metrics, and your cluster shape (executors, memory, runtime versions). Identifiers, string literals, file paths, and credentials stay local, replaced with tokens before upload and restored when candidates come back.

How is this different from a general coding agent?

Three ways. First, the surface area is small: spark-optimize can only propose code and Spark config changes inside the one job you point it at, never shell commands, dependency installs, or other files. Second, every candidate is AST-validated locally before it runs to ensure it stays inside that surface. Third, unlike general agents, spark-optimize runs an autonomous evaluation loop on the same job (generating, scoring, and discarding) until it converges on a measurable win or its candidate budget exhausts, which could take hours, all without getting distracted.

Could the agent ever make a job slower?

No. Each candidate runs under a timeout derived from your baseline plus a small margin, so a candidate that runs longer than the baseline is killed before it can be considered a winner. Worst case, a run explores its budget without finding a faster candidate and exits cleanly. Your existing job is untouched.

How do you prevent "faster but wrong" changes?

The CLI records the baseline run's terminal actions (writes, collect, show, count) and reruns each candidate against the same inputs. Row counts, schema, and write payloads are compared action by action. A candidate is only promoted when every action's output matches the baseline.

Do you optimize code, configuration, or both?

Both. The product is designed to optimize PySpark code and Spark configuration together for the same job.

Is this a sandbox?

No, and we don't pretend it is. spark-optimize is a constrained agent by design: no shell, no tool calls, no foreign filesystem access. On top of that, three layered checks: anonymization before upload, AST validation before execution, output equivalence before acceptance. Layered safety beats claimed sandbox isolation.

Can it validate write-heavy ETL jobs?

Yes. The CLI redirects terminal writes (Delta, Parquet, and similar) to scratch storage and compares baseline and candidate outputs action by action across row counts, schema, and write payloads. A candidate is only promoted when equivalence is verified end-to-end.

What do I get at the end of a run?

Two reviewable artifacts: `job.optimized.py` (the rewritten PySpark) and `job.optimized.conf.json` (the Spark configuration). They drop straight into your existing code-review and release process. Nothing is auto-promoted.

Can I run this against local, staging, or production?

Yes. Candidate evaluation runs in whatever environment you point the CLI at (local, staging, or production). The winning candidate reflects real runtime conditions because it was measured against your real cluster.

What runtimes and platforms are supported today?

PySpark 3.4+ on Python 3.10+. Today: Databricks (DBR 13+ with Unity Catalog), AWS EMR 6.x, and open-source Spark on YARN or Kubernetes, with Delta Lake read/write validation. On request: Databricks/EMR Serverless, Google Dataproc, Azure Synapse/Fabric, Spark Connect, and Iceberg/Hudi write validation. Not in scope today: Structured Streaming, Delta Live Tables, and Scala/Java Spark jobs.

If I run spark-optimize twice on the same job, do I get the same winner?

Not exactly. LLM-driven candidate generation has natural variance, and runtime measurements vary with cluster noise, so close candidates can rank differently across runs. The validation checks are the same regardless of which run produces the winner, so any winner you get is held to the same correctness bar.

How much compute does an optimization run use?

You set a budget for the number of candidates the agent will try. Within that budget, the agent explores the search space looking for bigger wins: a tighter budget runs faster and cheaper, a wider budget gives the agent more room to find structurally better optimizations. The agent never runs forever, and a run that exhausts its budget without finding a winner exits cleanly.

My jobs take date ranges and partition filters. Does the winner generalize?

The winner is specific to the baseline you measured. For production jobs with variable inputs, point the CLI at a representative date range or partition and re-run periodically as data volume shifts. The artifacts are plain Python and JSON, so nothing stops you from promoting the winner through your normal code-review and release process.

Does this work on Structured Streaming or long-running jobs?

Today it targets batch PySpark jobs with bounded inputs. Structured Streaming, Delta Live Tables, and continuously-running pipelines are out of scope for now. The validation approach depends on a baseline run that terminates.

Fit check

Let's see if one of your PySpark jobs is a fit.

Start with a PySpark job that runs daily or otherwise frequently. The more often it runs, the faster optimization turns into real savings.