Back to feed
arXiv cs.AI·

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

Signal
85
Hype
25
In three linesLEAP is an agentic framework enabling LLMs to generate mechanically verifiable formal proofs in Lean. The system decomposes complex problems into smaller units through iterative interaction with the Lean compiler. On 2025 Putnam Competition (12 problems), LEAP solves all 12; on Lean-IMO-Bench, it achieves 70% one-shot solve rate versus <10% for general-purpose LLMs.

## LEAP: an agentic framework that outperforms specialized IMO-level systems on formal proof

### What is actually happening

LEAP (arXiv:2606.03303) is an agentic framework layered on top of general-purpose LLMs — not a fine-tuned model, not a system trained on Lean-specific corpora — that produces mechanically verifiable formal proofs checked by the Lean compiler. The central empirical result: **70% solve rate on Lean-IMO-Bench**, up from below 10% for the same LLMs in one-shot mode, and critically above the **48% benchmark set by a specialized, gold-medal-caliber IMO system**. On the 2025 Putnam Competition, LEAP solves all 12 problems.

### The architecture behind the numbers

LEAP combines three mechanisms. First, **hierarchical decomposition**: complex problems are broken into tactical sub-goals, each tractable independently. Second, **informal-to-formal bridging**: the system first generates a natural-language blueprint leveraging the informal mathematical reasoning already present in general-purpose LLMs, then translates that blueprint into Lean syntax. Third, an **iterative refinement loop** using the Lean compiler as an oracle: each proof attempt is submitted to the compiler, type errors and logical inconsistencies surface as feedback signals, and the LLM corrects accordingly. This is not naive prompting — it is an agentic architecture where the compiler acts as a real-time formal verifier.

### Why Lean-IMO-Bench changes the evaluation landscape

Existing benchmarks (MiniF2F, ProofNet) are approaching saturation: top systems score high enough that discrimination becomes difficult. Lean-IMO-Bench introduces IMO-style problems formalized in Lean, characterized by **short statements but highly non-routine, multi-step proofs** spanning a wide difficulty range. This is precisely where one-shot approaches collapse: the brevity of the statement conceals structural complexity requiring long, branching reasoning chains. LEAP reaching 70% on a benchmark explicitly designed to resist saturation is the most informative number in the paper.

### The Knuth case: a research-level signal

Beyond competitive benchmarks, the authors demonstrate application to a real open problem: **Hamiltonian decomposition of even-order Cayley graphs**, a challenge connected to Knuth's work. LEAP produced a verified formal proof for a key subproblem of this combinatorial challenge. This is not anecdotal — it demonstrates the system operating outside its training regime, on mathematical structures absent from standard corpora. Lean verification guarantees the absence of silent errors, a chronic problem with unformalized LLM proofs.

### Who stands to lose

First loser: **Lean-specific fine-tuned systems**. If a generalist agentic framework outperforms a system specifically trained for IMO-level problems (70% vs. 48%), the cost justification for specialization erodes. Teams that have invested in Lean-specific fine-tuning pipelines need to reassess their roadmap.

Second loser: **informal verification approaches**. LLMs producing natural-language proofs without formal verification are exposed to undetected errors. LEAP establishes a mechanical verifiability standard that makes these approaches harder to defend in high-stakes contexts — research mathematics, critical software verification.

Third potential loser: **current benchmarks themselves**. MiniF2F and similar benchmarks lose discriminative power. Lean-IMO-Bench becomes the de facto reference for evaluating serious systems.

### What remains open

The paper does not detail the computational costs of the agentic loop — how many compiler calls per problem, total latency on hard IMO problems. Generalization to other proof assistants (Coq, Isabelle, Agda) is not addressed. The **scalability to research-length proofs** — major theorems, not subproblems — remains an open question. LEAP solves competition problems; the gap between a Putnam problem and a Fields Medal-level theorem proof remains substantial.

Read source
Your take?
AI AgentsReasoningBenchmarksPapers

Summary generated by Claude — human-verified