Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning
In three linesHilbert-Geo introduces a unified formal framework for solid geometry via Parse2Reason: parsing into Conditional Description Language (CDL) then reasoning with theorem bank. Achieves 77.3% on SolidFGeo2k and 84.1% on MathVerse-Solid, outperforming Gemini-2.5-pro (54.2%) and GPT-5 (62.9%). Two expert-annotated datasets: SolidFGeo2k and PlaneFGeo3k.
## Hilbert-Geo: symbolic reasoning crushes LLMs on solid geometry
### 1. What is actually happening
Hilbert-Geo is not another fine-tune on geometry problems. It is a complete neuro-symbolic framework that formalizes solid geometry into verifiable machine language, then executes theorem-driven reasoning — not token prediction. The result: 77.3% on SolidFGeo2k versus 54.2% for Gemini-2.5-pro, a gap of +23.1 points on the same benchmark. On MathVerse-Solid, Hilbert-Geo reaches 84.1% against GPT-5's 62.9%, a +21.2-point margin over OpenAI's most recent model.
These numbers are not marginal. They signal a structural ceiling for pure MLLMs on 3D spatial reasoning tasks, regardless of model size or RLHF tuning.
### 2. The Parse2Reason architecture: two steps, zero geometric hallucination
The method enforces a clean separation between perception and deduction:
**Step 1 — Parsing**: the problem (text + 3D diagram) is converted into CDL (Conditional Description Language), a formal predicate language specifically designed to encode geometric conditions. A prism or pyramid diagram becomes a verifiable list of predicates: incidence relations, parallelism, perpendicularity, angular measures. This is not approximate visual captioning — it is a constrained symbolic representation.
**Step 2 — Reasoning**: from the CDL and a dedicated theorem bank, the system performs relational inference and algebraic computation. The resulting reasoning process is described by the authors as "strictly correct, verifiable, and human-readable" — meaning every step can be audited, unlike a chain-of-thought generated by an LLM.
The theorem bank is the real proprietary asset here. It covers solid geometry (volumes, plane sections, projections) and plane geometry (80.2% on PlaneFGeo3k), demonstrating the framework is not overfit to a single domain.
### 3. The datasets: SolidFGeo2k and PlaneFGeo3k
The absence of formally annotated benchmarks was precisely what blocked progress on solid geometry. The authors release two expert-annotated datasets:
- **SolidFGeo2k**: ~2,000 solid geometry problems with CDL annotations, solutions, and answers - **PlaneFGeo3k**: ~3,000 plane geometry problems in the same format
The value of these datasets extends beyond Hilbert-Geo itself: they constitute evaluation infrastructure for all future systems. MathVerse-Solid already existed but represents a narrow subset; SolidFGeo2k is the first large-scale dedicated benchmark for formal solid geometry.
### 4. Potential losers and limits to watch
**Generalist MLLMs** are the immediate losers on this task class. Gemini-2.5-pro and GPT-5 are models with hundreds of billions of parameters trained on massive corpora — and they lose by 20+ points to a specialized system. This reopens the ROI question for generalization on formal reasoning tasks.
**Purely end-to-end geometry approaches** (e.g., DeepMind's AlphaGeometry 2, which targets olympiad-level plane geometry) have no direct answer for solid geometry. Hilbert-Geo occupies an uncovered space.
**Unresolved limitations**: the parsing step remains dependent on the quality of visual recognition of 3D diagrams — the paper does not report parsing error rates in isolation. If the generated CDL contains a predicate error, the downstream symbolic reasoning will be formally correct but factually wrong. Robustness to visual noise (poorly drawn diagrams, ambiguous perspectives) is not quantified.
Furthermore, the theorem bank is not yet fully public — the authors announce code and dataset release, but the exact coverage of implemented theorems will determine the system's edge cases.
**The real test** will be out-of-distribution performance — advanced high school competition problems or undergraduate-level configurations — where geometric setups fall outside the encoded theorem scope. That is where LLMs, despite their weaknesses, retain a flexibility that symbolic systems lack.
Summary generated by Claude — human-verified