SPEAR: Code-Augmented Agentic Prompt Optimization
In three linesSPEAR is an agentic prompt optimizer integrating a Python sandbox for structural error analysis (confusion matrices, clustering). Evaluated on 13 industrial LLM-as-judge tasks and BBH-7, it outperforms GEPA and TextGrad (κ 0.857 vs 0.359 on tool-selection; F1-macro 0.815 vs 0.763). Python tool contributes +0.79κ on complex judge tasks.
## SPEAR: an APE optimizer that writes its own diagnostic code
### What SPEAR actually does
Automatic prompt engineering (APE) typically runs a fixed loop: evaluate a prompt, generate a critique, propose a rewrite. GEPA and TextGrad, the two main baselines, treat this loop as a static pipeline where the optimizer diagnoses failures solely from its context window. SPEAR breaks this model by porting the CodeAct paradigm (Wang et al., 2024) into the APE space: the agent has four tools — `evaluate`, `python`, `set_prompt`, `finish` — and autonomously decides their order and frequency.
The distinctive tool is the Python sandbox. The agent writes and executes arbitrary code on the current evaluation DataFrame: confusion matrices, error clustering, per-subgroup metrics. This is not a pre-wired module; it is code the agent *authors* at each iteration based on what it observes. Two guardrails turn this long-horizon agent into a monotone-improving optimizer: auto-rollback on metric regression, and an optional guard metric floor.
### The numbers that justify the signal score
Across 13 industrial LLM-as-judge tasks (three suites: recruiter intake, conversational memory, query refinement), SPEAR wins every task on the primary metric: - **Tool selection**: κ 0.857 vs 0.359 (GEPA) — a +0.498κ gap on a 5-class task - **Filter relevance**: F1-macro 0.815 vs 0.763 - **Hardest extraction dimension**: κ 0.254 vs 0.218 (modest gain, but on the most resistant task)
On BBH-7, the gap widens further: average accuracy 0.938 for SPEAR vs 0.628 for GEPA and 0.484 for TextGrad. GSM8K follows the same pattern.
The ablation is the most informative part of the paper: removing the Python tool alone costs **+0.79κ on the tool-selection task** and **+0.35κ on the extraction dimension**. The identified reason is specific: class-pair confusion aggregation is an operation that long-context LLMs cannot perform reliably on raw tabular DataFrames. The code is not decorative — it compensates for a structural limitation of attention over dense tabular data.
### Why this matters for practitioners
Most teams deploying LLM-as-judge in production face exactly the problem SPEAR addresses: a judge prompt that performs well on average but has systematic blind spots on specific classes or patterns. Diagnosing those blind spots manually is expensive; existing APE loops miss them because they read the DataFrame in context without extracting its structure.
The four-tool architecture also signals a direction for prompt engineering practice: the prompt is no longer the only lever — *analyzing* prompt behavior itself becomes automatable. SPEAR is evaluated on real industrial tasks (recruiting, conversational CRM), not only academic benchmarks, which strengthens transferability claims.
### Losers and limits that should not be glossed over
**GEPA and TextGrad** are the direct losers: on BBH-7, TextGrad reaches 0.484 where SPEAR reaches 0.938. Teams that have integrated these tools into MLOps pipelines face a migration question.
SPEAR's structural limits deserve scrutiny. A long-horizon agent with a Python sandbox increases token cost and latency compared to a standard APE loop — the paper does not publish an explicit cost comparison, which is a notable omission. Auto-rollback guarantees monotonicity but can trap the optimizer in a local basin if the performance landscape is multimodal. The quality of the diagnostic code generated by the agent is not independently audited: an agent writing incorrect diagnostic code could silently converge toward a bad prompt.
Code is not yet released at arXiv announcement time, limiting immediate reproducibility. Dependence on CodeAct also means performance is tied to the underlying model used as optimizer — a variable teams will need to calibrate against their own stack.
Summary generated by Claude — human-verified