Page 6 of 137

AllHigh signalRecent
5462 articles
arXiv cs.AI·

The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

arXiv paper proposing CLEAR, an optimal budget allocation method for LLM inference grounded in economic theory. Using a shifted-surge utility function and global shadow pricing, CLEAR performs rational abandonment and reallocates resources from insolvent to solvable queries. Results: 3x improvement in global accuracy vs uniform allocation under resource scarcity.

ReasoningBenchmarksInfrastructure
SIG
78
HYP
25
arXiv cs.AI·

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

SkillDAG models inter-skill relationships as a typed directed graph for dynamic LLM agent skill selection at inference time. On ALFWorld and SkillsBench with MiniMax-M2.7, it achieves 67.1% success and 27.3% reward, exceeding Graph-of-Skills baselines by +12.8 and +8.6 points. The graph self-evolves during execution via a propose-then-commit protocol, accumulating structure across episodes.

AI AgentsReasoningBenchmarks
SIG
78
HYP
25
arXiv cs.LG·

RESCAST-100K: A Comprehensive Dataset for Cross-Domain Residential Load and Indoor Temperature Forecasting

RESCAST-100K is a benchmark of 100,000 U.S. homes simulated via EnergyPlus/ResStock for evaluating cross-domain generalization in residential energy load and indoor temperature forecasting. 15-minute time series dataset with 40+ static building covariates, integrating 5 real-world datasets. Cross-attention and MLP-mixer models outperform classical transformers under domain shift.

BenchmarksFine-tuningPapers
SIG
78
HYP
15
arXiv cs.CL·

Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics

Method to predict best-of-N inference scaling gains without running the full procedure. Ridge predictor identifies 3 stable features (prompt-level agreement spread, label-assisted first-correct-sample position, completion-length variance) plus entropy, reaching Spearman ρ=0.90 correlation with actual gains across model families and math/reasoning tasks.

ReasoningEvalsReinforcement learning
SIG
78
HYP
15
arXiv cs.AI·

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer co-evolves LLM policies and training harnesses via empirical feedback for autonomous agentic RL. Tested on mathematical reasoning, competitive programming code generation, and software engineering, the system matches or exceeds human-engineered RL baselines, with largest gains on long-horizon agentic SWE tasks.

AI AgentsReinforcement learningCode generation
SIG
78
HYP
25
arXiv cs.LG·

Are we really tilting? The mechanics of reward guidance in flow and diffusion models

Reward guidance algorithms steer generative processes toward reward-tilted measures. The paper shows reward hacking stems from finite-particle plug-in estimation of the Doob h-function in practical implementations. Authors propose a closed-form reward damping schedule and validate on Gaussian targets, 2D checkerboard, and FLUX.1 text-to-image generation.

Reinforcement learningReasoningPapers
SIG
78
HYP
15
Reddit r/MachineLearning·

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Comparative study of learning rules (backprop, feedback alignment, predictive coding, STDP) via RSA alignment with human V1 fMRI. Backprop destroys 90% of V1 alignment after 1 epoch (r: 0.102→0.011), while PC and STDP lose only 25-31%. At epoch 40: PC/STDP >> BP/FA. Suggests fundamental trade-off between global error signals (higher layers) and early-layer alignment.

AlignmentBenchmarksPapers
SIG
78
HYP
15
Reddit r/MachineLearning·

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

CVE-Bench evaluates 5 frontier models on 20 real-world CVEs (Pillow, GitPython, urllib3, etc.) across 300 runs. Max solve rate 50% (60% under advisory). Agents patch syntactically but leave vulnerabilities open. Significant cross-family gaps (OpenAI vs Laguna, p<0.05), within-family noise. Failure modes: wrong-search drift, hallucinations, context loss.

AI AgentsBenchmarksAI safety
SIG
78
HYP
15