Back to feed
arXiv cs.CL·

AstroMind: A High-Fidelity Benchmark for Spacecraft Behavior Reasoning Based on Large Language Models

Signal
75
Hype
15
In three linesAstroMind is a benchmark for evaluating LLM reasoning on spacecraft behavior. Built on high-fidelity astrodynamics simulations, it tests intent inference, maneuver parameter estimation, and threat assessment. Qwen3 (32B) leads intent inference, QwQ (32B) leads threat assessment, GPT-OSS (20B) produces strongest reasoning quality.
Read source
Your take?
BenchmarksReasoningQwenGPT

Summary generated by Claude — human-verified