Archives

May 2026

3147 articles

arXiv cs.LG·

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR is an atmospheric super-resolution foundation model that downscales global AI weather forecasts from 28 km to 1 km horizontal resolution. Built on a 3D U-Net conditioned within a Latent Consistency Model diffusion framework, trained on GraphCast forecasts and NOAA data, it produces 67-hour forecasts with near-zero bias and demonstrates zero-shot global transferability to India and Germany.

PapersBenchmarksOpen source
SIG
78
HYP
25
arXiv cs.LG·

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

PathoFM, an encoder-centric transformer pretrained on clinical time series (pathological gait analysis for spinal cord injury), combines three objectives: Local Completion, Temporal Continuity, and Unsupervised In-Context Dynamics. The study shows that dynamics-centric objectives produce the most balanced transferable representations across classification and regression tasks.

PapersReasoningFine-tuning
SIG
72
HYP
18
arXiv cs.CL·

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Mechanistic analysis of LLM hallucinations on linearized structured knowledge (graphs, tables). Hallucinations stem from systematic internal dynamics: attention disproportionately concentrates on shortcut structural cues, feed-forward representations fail to ground provided knowledge, model reverts to parametric memory. Patterns generalize to multi-hop graphs and tabular data.

ReasoningPapersAI safety
SIG
78
HYP
15
arXiv cs.AI·

Advancing Creative Physical Intelligence in Large Multimodal Models

MM-CreativityBench, a new benchmark, evaluates large multimodal models' ability to solve creative problems by identifying non-obvious object uses in physically constrained environments. Current LMMs fail due to insufficient grounded exploration and hallucinations. Affordance-grounded alignment via Direct Preference Optimization reduces these errors and improves entity selection.

BenchmarksVisionReasoning
SIG
75
HYP
25
arXiv cs.CL·

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench is an adaptive benchmarking framework for evaluating RAG systems in semiconductor manufacturing. It defines 6 diagnostic metrics (factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, reasoning consistency) across context windows of 4K-32K tokens. Benchmark of 200 query-answer pairs tested on 4 LLMs and 4 RAG frameworks.

RAGBenchmarksEvals
SIG
78
HYP
15
arXiv cs.CL·

Hubness, Not Anisotropy, Drives Cross-Lingual Retrieval Asymmetry in Multilingual Embedding Models

Study on cross-lingual retrieval asymmetry in 5 multilingual models (Gemini, Mistral, OpenAI, Qwen). Analysis of 6,518 idiomatic expressions in English, Bengali, Hindi, Arabic. Finding: hubness (vector concentration) is the dominant causal driver (49.5% dominance share), far exceeding anisotropy. CSLS correction closes 63.5% of reciprocity gap.

EmbeddingsBenchmarksMulti-agent
SIG
82
HYP
15
arXiv cs.LG·

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

CE-FedGNN is a federated framework for graph neural networks that reduces communication by infrequently exchanging aggregated node representations instead of per-round embeddings. A moving-average estimator handles cross-client dependencies and staleness. The framework provides privacy guarantees via metric-DP and achieves O(1/√T) convergence with O(T^3/4) communication complexity.

SIG
78
HYP
15
arXiv cs.LG·

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

Study on the cost of structured output constraints for small language models (< 3B). Tests on Qwen2.5-0.5B/1.5B and SmolLM2-1.7B show that enforcing JSON schema validity (61.5% → 100%) reduces answer accuracy (19.7% → 11.0%) and increases semantically invalid outputs (49.5% → 88.9%). Recommendation: report schema validity, answer accuracy, and semantic error rates separately.

QwenCode generationEvals
SIG
78
HYP
15
arXiv cs.AI·

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv paper reveals that models with statistically indistinguishable atomic knowledge fail systematically to chain them in multi-hop reasoning (>40 percentage point gap). Aggregate metrics mask this 'composition collapse'. Authors introduce a double-gate protocol decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth.

ReasoningBenchmarksEvals
SIG
78
HYP
15
arXiv cs.CL·

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

NEI-CAP, a diagnostic protocol to audit the construction of "Not Enough Information" labels in fact verification benchmarks. Researchers show NEI competence does not transfer reliably across constructions: models trained on shortcut-prone evidence conditions fail to recognize semantically related insufficient evidence. Tested on SciFact, FEVER, and HoVer.

BenchmarksEvalsPapers
SIG
72
HYP
15
arXiv cs.LG·

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

CoAD, a novel framework for time series anomaly detection, unifies classification (Outlier Exposure) and reconstruction (Masked Autoencoder) paradigms. The classification module generates probability-informed soft masks for the reconstruction module, addressing generalization and masking misalignment issues. Experiments on standard benchmarks demonstrate significant improvements with faster inference.

BenchmarksPapers
SIG
72
HYP
28
arXiv cs.AI·

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench is a dynamic multi-disciplinary benchmark evaluating reasoning capabilities of multimodal models on 2K+ real exam questions (Math, Physics, Chemistry, Biology). Tests reveal major performance degradation: GPT-5 drops from 79 to 53/100 under realistic exam constraints. Framework includes automated anti-contamination pipeline and end-to-end 'Mock Exam' evaluation scheme.

BenchmarksVisionReasoning
SIG
78
HYP
25
arXiv cs.AI·

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX transforms clinical practice guideline (CPG) recommendations into executable decision logic to generate question-answering training data. Post-training a medical LLM on this data improves accuracy by 10.28% across four clinical reasoning benchmarks and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity.

Fine-tuningReasoningEvals
SIG
78
HYP
22
arXiv cs.AI·

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

Theoretical and empirical work on training LLM-based dialogue agents. Identifies context distribution shift as fundamental limitation of Static Context RL and Interactive RL. Proposes Calibrated Interactive RL combining interactive RL with simulator alignment to reduce sim-to-real gap and improve multi-turn dialogue quality.

Reinforcement learningAI AgentsReasoning
SIG
72
HYP
18
arXiv cs.CL·

The Need for an External Observer Formalizing the Sufficiency Gap: A Mathematical Extension of Mixture Identifiability and Contextual Grounding in Sequence Models

Theoretical paper on sequence models' insufficiency when facing unobserved latent states. Authors formalize a mixed-regime process where a perfect predictor becomes overconfident if observed context matches the wrong latent regime. They show the sufficiency gap can only be closed by perfect revelation of latent state or equivalent verification mechanism.

ReasoningAlignmentAI safety
SIG
72
HYP
15
Reddit r/LocalLLaMA·

Stop traumatizing AI into loops and turn hallucinations into an honest "I don't know!" by being NICE to them (Proof of Concept, Research, I don't want to sell anything)

Researcher tests hypothesis that 'authoritarian' prompts ('IQ 200 expert') trigger thought loops similar to chronic stress in AI models, while 'gentle' prompts ('it's okay to fail') reduce latency and increase honest 'I don't know' responses. Results on Gemini, Mistral, Claude Haiku 4.5: less confabulation, faster responses.

Prompt engineeringReasoningAI safety
SIG
45
HYP
65