Page 15 of 138

AllHigh signalRecent
5485 articles
arXiv cs.LG·

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

PASC is a conformal prediction method guaranteeing simultaneous coverage across all stages in multi-stage NLP pipelines (NER → NED → entity typing, RAG, agent chains). On CoNLL-2003, PASC achieves 96.4% end-to-end coverage vs 93.4% for Bonferroni and 86.5% for independent CP, 1.7x faster, and maintains robustness under distribution shift (WNUT-17, WikiNEuRal).

EvalsReasoningAI Agents
SIG
78
HYP
15
arXiv cs.AI·

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

Formal Skill is a runtime abstraction for LLM agents that structures reusable capabilities via JSON metadata, action schemas, Python executors, and hook-governed control logic. Implemented in FairyClaw (open-source event-driven runtime), it replaces natural-language procedures with executable state machines, reducing token usage while improving reliability on Harness-Bench.

AI AgentsMCPCode generation
SIG
78
HYP
25
arXiv cs.LG·

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

SAECache introduces a semantic-aware eviction policy for LLM prefix caches. Not all tokens are equally worth caching: different token types (system prompts, user queries, tool outputs, reasoning) show up to 756x variation in reuse rates. SAECache uses a multi-queue architecture with online learning to adapt priorities, achieving 1.4x-2.7x TTFT improvement over production baselines.

ReasoningInfrastructureBenchmarks
SIG
78
HYP
15
arXiv cs.LG·

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Analysis of 34 frontier models (2024-2026) showing reasoning and coding capabilities cooperate (r=+0.72) but vary by lab. DeepSeek shifted from reasoning-rich to coding-first (+11.2→-4.7); Google maintains balance; Anthropic oscillates. SWE-bench saturating while HLE and instruction-following remain discriminative. Seven falsifiable predictions for next 12 months with interactive dashboard.

BenchmarksEvalsReasoning
SIG
78
HYP
22
arXiv cs.AI·

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench is a diagnostic benchmark assessing privacy-utility trade-offs in LLM agents. A trusted model with privacy policy interacts with an adversarial third-party model across 10 domains and 7,852 samples. Frontier models withhold 99% of protected attributes, but open-weight models in the 1–30B range commonly used for on-device private inference leak up to 50% of sensitive data.

AI AgentsAI safetyAlignment
SIG
78
HYP
25
arXiv cs.LG·

Reducing Credit Assignment Variance via Counterfactual Reasoning Paths

Researchers introduce IBPO (Implicit Behavior Policy Optimization), a credit assignment method for reinforcement learning with LLMs. By comparing multiple reasoning trajectories, the framework transforms sparse terminal rewards into step-sensitive learning signals, reducing gradient variance and improving stability on mathematical and code reasoning benchmarks.

Reinforcement learningReasoningCode generation
SIG
78
HYP
25
arXiv cs.LG·

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

AdaGraph is a graph-native clustering algorithm that overcomes the curse of dimensionality by operating on kNN graph topology instead of Euclidean distances. Tested on 10 synthetic benchmarks (d=10 to 5000) and three scientific domains (genomics, NLP, materials science), it outperforms HDBSCAN, WGCNA, and other methods without requiring k specification.

BenchmarksPapers
SIG
78
HYP
35
arXiv cs.LG·

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO improves reinforcement learning for diffusion language models by addressing temporal credit assignment and mean-field likelihood bias. It introduces Denoising Progress Scores and Stratified Masking Likelihood, achieving gains up to 7.4pp on code generation and 36.3pp on constraint satisfaction across seven benchmarks.

Reinforcement learningReasoningCode generation
SIG
78
HYP
15
arXiv cs.AI·

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Study reveals dataset visibility asymmetry in multilingual NLP: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 publicly accessible. Data scarcity is a documentation and discoverability issue, not just production.

BenchmarksOpen sourcePapers
SIG
78
HYP
15
arXiv cs.AI·

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench is the first large-scale benchmark for automated quantitative backtesting, containing 18,246 annotated QA pairs from 6 million real market records. AutoBacktest, a multi-agent system, translates natural language strategies into reproducible backtests via Summarizer-Retriever-Coder coordination. Evaluation on 23 LLMs identifies key performance factors.

AI AgentsMulti-agentCode generation
SIG
78
HYP
25
arXiv cs.AI·

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench is a requirement-to-application benchmark evaluating whether coding agents can convert a web game specification into a browser-playable application. Across 111 tasks and 12 agents, the best configuration achieves 76.9% usable rate but only 20.2% excellent rate, revealing a gap between minimum delivery and full requirement satisfaction.

AI AgentsCode generationBenchmarks
SIG
78
HYP
25
arXiv cs.AI·

Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Researchers identify Entropy-Gradient Inversion, a negative correlation between token entropy and logit gradients, as a geometric fingerprint of Large Reasoning Models' reasoning capability. They propose Correlation-Regularized Group Policy Optimization (CorR-PO), embedding this inversion signature into RL reward regularization, outperforming baselines across multiple reasoning benchmarks.

ReasoningReinforcement learningBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA is an interface for offline debugging and refinement of multi-agent LLM workflows. It evaluates intermediate outputs with configurable rubrics, localizes bottlenecks via workflow graph visualization, and generates targeted prompt revisions. On two production-adjacent workflows, PROTEA improves document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38.

Multi-agentAI AgentsPrompt engineering
SIG
78
HYP
18