Archives

June 2026

485 articles

arXiv cs.CL·

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

BenchmarksEvalsReasoning
SIG
82
HYP
15
arXiv cs.CL·

Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions

Researchers propose an agent economy where AI agents self-coordinate through auctions and payment exchanges without centralized control, inspired by Hayek's economic theory. This approach generates emergent multi-step reasoning strategies and outperforms baselines on five tasks including mathematical reasoning, financial research, and distributed-system optimization.

Multi-agentAI AgentsReasoning
SIG
72
HYP
35
arXiv cs.LG·

Auditable Climate Risk Intelligence from Fragmented ESG Data: Deterministic Orchestration and Imbalance-Aware Learning for Scope 1-3 Validation

Deterministic orchestration framework for validating fragmented ESG data (Scope 1-3) with temporal anomaly detection, imbalance-aware ensemble learning, and audit provenance tracing. Synthetic benchmark calibrated against GHG Protocol, PCAF, ISSB standards. Evaluation on classification, calibration, and provenance chain completeness metrics.

BenchmarksEvalsReinforcement learning
SIG
72
HYP
15
arXiv cs.AI·

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

ChatHealthAI aligns structured EHR representations from a pretrained EHR foundation model with a frozen LLM's semantic space via a task-aware resampler. The multimodal framework integrates longitudinal patient representations with refined clinical event descriptions, improving interpretable clinical reasoning while maintaining competitive predictive performance on the EHRSHOT benchmark.

RAGReasoningEvals
SIG
72
HYP
18
arXiv cs.CL·

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

Framework combining conformal prediction and collaborative filtering-style annotator representation to analyze LLM behavior against human annotators in content moderation. Introduces Ghost Prediction metric to quantify model-human divergences. Evaluation across 4 LLMs and 4 datasets shows larger models more confident on texts with no human alignment, revealing structural demographic bias.

EvalsAI safetyAlignment
SIG
72
HYP
18
arXiv cs.CL·

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

Method to predict best-of-N inference scaling gains without running the full procedure. Ridge predictor identifies 3 stable features (prompt-level agreement spread, label-assisted first-correct-sample position, completion-length variance) plus entropy, reaching Spearman ρ=0.90 correlation with actual gains across model families and math/reasoning tasks.

ReasoningEvalsReinforcement learning
SIG
78
HYP
15
arXiv cs.AI·

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

SkillDAG models inter-skill relationships as a typed directed graph for dynamic LLM agent skill selection at inference time. On ALFWorld and SkillsBench with MiniMax-M2.7, it achieves 67.1% success and 27.3% reward, exceeding Graph-of-Skills baselines by +12.8 and +8.6 points. The graph self-evolves during execution via a propose-then-commit protocol, accumulating structure across episodes.

AI AgentsReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

DeltaMem organizes LLM agent experience memory into two residual trees: one stores goal-conditioned tasks as reusable skills, another stores scene-level environment knowledge. Each tree uses root nodes for generalized base experiences and delta nodes for variations, eliminating redundancy. An autonomous consolidation mechanism distills high-frequency paths into new root nodes.

AI AgentsReasoningPapers
SIG
75
HYP
25
arXiv cs.AI·

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

arXiv paper proposing CLEAR, an optimal budget allocation method for LLM inference grounded in economic theory. Using a shifted-surge utility function and global shadow pricing, CLEAR performs rational abandonment and reallocates resources from insolvent to solvable queries. Results: 3x improvement in global accuracy vs uniform allocation under resource scarcity.

ReasoningBenchmarksInfrastructure
SIG
78
HYP
25
arXiv cs.LG·

Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning

HITL-GB framework for short-term rental dynamic pricing: a contextual bandit algorithm generates price recommendations that a human can accept, modify, or reject. Authors show historical data is structurally equivalent to on-policy warm-up, reducing cold-start from ~150 to ~30 episodes. Validated on 1,461 real nights (April 2022–2026).

AI AgentsReinforcement learningBenchmarks
SIG
75
HYP
15
arXiv cs.AI·

From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting

Novel framework combining importance-aware news compression and process-level retrieval supervision for time series forecasting. A reward model estimates each article's forecasting utility for sequential fusion, while a PRM ranks supplementary-news candidates based on error profile. Experiments on finance, energy, traffic, and bitcoin benchmarks show improved accuracy and fewer refinement iterations.

LlamaReasoningRAG
SIG
72
HYP
28
arXiv cs.AI·

DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration

DeskCraft is a desktop GUI benchmark for agents on long-horizon professional workflows (>50 steps) in design, video, audio, and 3D with human-agent collaboration. 18 agents tested on 538 tasks: GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Reveals persistent failures in proactive clarification and long-horizon workflow delivery.

AI AgentsBenchmarksEvals
SIG
82
HYP
18
arXiv cs.AI·

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer co-evolves LLM policies and training harnesses via empirical feedback for autonomous agentic RL. Tested on mathematical reasoning, competitive programming code generation, and software engineering, the system matches or exceeds human-engineered RL baselines, with largest gains on long-horizon agentic SWE tasks.

AI AgentsReinforcement learningCode generation
SIG
78
HYP
25