Back to feed
arXiv cs.AI·

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

Signal
82
Hype
15
In three linesPRISM is a 10,372 instruction-code pair benchmark for evaluating programmatic video generation by LLMs. It proposes 4 metrics: code reliability, spatial coherence, visual complexity, and temporal density. Evaluation of 7 LLMs reveals a 41% execution-spatial gap: executable code does not guarantee spatially coherent output.

## PRISM: when code executability is no longer enough

### 1. What's being announced

PRISM (arXiv:2605.19382) is a benchmark of 10,372 human-calibrated instruction-code pairs designed to evaluate LLMs on programmatic video generation — code that produces geometrically precise animations rather than pixel-level outputs from diffusion models. The dataset spans 437 subject categories, is bilingual (English and Chinese), and is positioned as 20× larger than prior benchmarks in this sub-domain. It introduces four metrics organized as a funnel: Code-Level Reliability (raw executability), Spatial Reasoning (layout correctness across the full animation sequence), Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD).

### 2. The number that matters: the 41% execution-spatial gap

Evaluation of 7 mainstream LLMs reveals that execution success rate and spatial pass rate diverge by an average of **41 percentage points**. A model can produce syntactically valid, runnable code while generating visually incoherent output — mispositioned objects, disordered temporal sequences, layouts that violate the spatial constraints of the instruction.

Before PRISM, the state of the art in programmatic evaluation essentially stopped at executability: the code runs or it doesn't. Existing benchmarks — significantly smaller, monolingual, with narrow thematic coverage — did not distinguish between "the code executes" and "the code produces what was visually requested." This 41% gap quantifies for the first time the scale of this competence illusion: LLM rankings based on executability alone are potentially misleading across half the real performance spectrum.

### 3. Why programmatic video, and why now

Diffusion-based video generation (Sora, Runway, Kling) excels at perceptual realism but fails on geometric precision and controlled temporal coherence. For use cases like scientific visualization, educational animations, dynamic diagrams, or data simulations, code (Manim, matplotlib animations, Three.js, etc.) remains the only reliable vector for spatial precision. LLMs are increasingly tasked with generating this code from natural language instructions — hence the urgency of a benchmark that measures visual output quality, not just syntax.

The bilingual scope (EN/ZH) and 437 categories also signal encyclopedic coverage ambition: PRISM aims to be the standard reference, not a niche benchmark.

### 4. Potential losers and limitations

**Models ranked highly on existing executability benchmarks** are most exposed: if their lead rests on syntactic reliability rather than spatial coherence, PRISM pushes them down. Teams that have optimized fine-tunings or system prompts around classic pass@k metrics will need to rework their evaluation pipelines.

**Publishers of programmatic video generation frameworks** (Manim foremost) see their ecosystem become a standardized evaluation ground — which may accelerate adoption but also expose API limitations under complex instructions.

**Notable methodological limitation**: PRISM is human-calibrated, which ensures pair quality but introduces selection bias on which scenario types are deemed representative. The PADVC metric (dynamic visual complexity) remains the hardest to interpret without implementation details — its exact operationalization is not fully transparent in the abstract.

**Temporal Density** as a standalone metric is promising but not validated against human judgments of temporal quality in the published abstract — a point to verify in the full paper.

In summary: PRISM raises the bar for evaluating LLMs on visual code generation. The 41% gap is not a benchmark artifact — it is the measure of a systematic blind spot in current evaluation practice. Any lab publishing LLM performance on visual generative code tasks without a spatial metric is now structurally underestimating its error rate.

Read source
Your take?
BenchmarksCode generationVideo generationReasoningEvals

Summary generated by Claude — human-verified