Page 23 of 192

AllHigh signalRecent

7679 articles

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

COSMO-Agent, a tool-augmented RL framework, trains LLMs to orchestrate iterative CAD-CAE processes. The system learns to generate parametric geometry, solve simulations, and revise designs under multiple constraints. Industry-aligned dataset covering 25 component categories. Trained small LLMs outperform large open-source and closed-source models in feasibility and stability.

AI Agents Reinforcement learning Tools

SIG

HYP

arXiv cs.LG·May 22

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

ConTact explicitly decomposes antibody CDR design into three stages: surface complementarity fingerprints, CDR-antigen contact prediction, and contact-gated feature injection. On CHIMERA-Bench, the model achieves 7% RMSD improvement, 10% F1 gain in epitope awareness, and 0.38 AAR sequence recovery over baselines.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.AI·May 22

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ScenePilot generates critical scenarios for autonomous driving testing via multi-objective reinforcement learning. The framework combines RSS-derived physical feasibility with an AV-risk predictor to target boundary-band scenarios: physically solvable yet causing failures. Results: +6.2 percentage points collision rate on SafeBench while preserving physical validity.

Reinforcement learning AI safety Evals

SIG

HYP

arXiv cs.CL·May 22

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

ArabDiscrim is a corpus of 293K Arabic Facebook posts (2014-2024) on racism and discrimination. It includes 200 curated terms with morphological families (13+ inflections), 20 discrimination axes, and native engagement signals (reactions, shares, comments). Released under restricted research-use license for ethical compliance.

Benchmarks AI safety Alignment

SIG

HYP

arXiv cs.CL·May 22

Token-weighted Direct Preference Optimization with Attention

Token-weighted DPO (TwDPO) and AttentionPO introduce preference optimization that weights tokens by importance. AttentionPO uses the model's own attention to estimate weights without a separate reward model. Results: improvements on AlpacaEval, MT-Bench, ArenaHard.

Reinforcement learning Alignment Benchmarks

SIG

HYP

arXiv cs.CL·May 22

ACC: Compiling Agent Trajectories for Long-Context Training

ACC converts agent trajectories (search, software engineering, database querying) into long-context QA pairs for SFT training. Removes tool response masking and creates explicit supervision over distant dependencies. Qwen3-30B-A3B achieves +18.1 on MRCR and +7.6 on GraphWalks, comparable to Qwen3-235B.

AI Agents Reasoning Fine-tuning

SIG

HYP

arXiv cs.CL·May 22

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

Study across 6,620 runs showing Claude Haiku compresses 10 English intensity modifiers into 5 distinct outputs. System state context dominates lexical effect (explained variance: 0.782 vs 0.079). Near operational boundaries, model exhibits three modes: small adjustments for weak words, abstention for strong words, ceiling-pushing for 'drastically'.

Claude Evals Reasoning

SIG

HYP

arXiv cs.CL·May 22

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

OGCaReBench is a retrieval-focused benchmark evaluating LLMs on off-guideline clinical questions extracted from published medical case reports. GPT-5.2 achieves 56% without retrieval, 82% with retrieved medical articles. Specialized models reach only 42%.

Benchmarks RAG Reasoning

SIG

HYP

arXiv cs.CL·May 22

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge is a benchmark generator for evaluating LLMs-as-judges on multi-turn conversations grounded in reference documents. The system creates conversation pairs with a single flaw injected into one turn, enabling unambiguous labeling. Evaluation of 21 frontier LLM judges ranked via Bradley-Terry model across machine learning, biomedicine, and finance domains.

Evals Benchmarks Multi-agent

SIG

HYP

arXiv cs.CL·May 22

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Sem-Detect detects AI-generated peer reviews by analyzing textual features and semantic analysis at claim level. The method compares a target review against multiple AI-generated reviews of the same paper, exploiting AI model convergence versus human reviewer diversity. On 20,000+ ICLR/NeurIPS reviews, Sem-Detect improves strongest baseline by 25.5% in TPR@0.1% FPR.

Evals AI safety Papers

SIG

HYP

arXiv cs.AI·May 22

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

arXiv paper on data scaling laws: progressive coverage of a latent predictive contribution spectrum (via suffix-automaton representation) strongly correlates with empirical scaling exponent. Across 12 real corpora, log K(N) shows near-linear relationship with log N (R²≈0.96), suggesting training advances an effective frontier through a predictive state spectrum.

Benchmarks Papers Reasoning

SIG

HYP

arXiv cs.AI·May 22

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

DeepWeb-Bench is a deep research benchmark evaluating 9 frontier models on tasks requiring massive evidence collection, cross-source reconciliation, and long-horizon multi-step derivation. Errors stem primarily from derivation and calibration (>70%), not retrieval (12-14%). Strong and weak models fail differently: incomplete derivation vs hallucinated precision.

Benchmarks Reasoning AI Agents

SIG

HYP

arXiv cs.CL·May 22

SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

SpecHop accelerates multi-hop retrieval agents by maintaining multiple speculative threads with faster but less reliable tools, asynchronously verifying predictions and committing/rolling back branches. The framework reduces latency by up to 40% while preserving accuracy and final trajectory, approaching oracle latency gains with sufficient active threads.

AI Agents RAG Reasoning

SIG

HYP

arXiv cs.AI·May 22

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

PALS is an LLM inference optimization system integrated into vLLM that treats GPU power caps as a controllable parameter. By combining offline power-performance models with feedback-driven control, it improves energy efficiency by up to 26.3% and reduces QoS violations by 4x to 7x across dense and mixture-of-experts models.

Infrastructure Benchmarks Tools

SIG

HYP

arXiv cs.AI·May 22

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

AutoRPA distills decision logic from LLM agents (ReAct paradigm) into robust RPA functions using a translator-builder pipeline and retrieval-augmented generation. On repetitive GUI tasks, generated functions reduce token usage by 82–96% while maintaining performance.

AI Agents Code generation Reasoning

SIG

HYP

ActuIA·May 21

Anthropic loue Colossus 1 à 1,25 Md$/mois sur un parc xAI qui plafonne à 11% de capacité

Anthropic leases Colossus 1, xAI's supercomputer, for $1.25B/month through May 2029 ($40B+ total). The contract caps Anthropic's access at 11% of cluster capacity, restricting the company to a fraction of available resources.

Anthropic Infrastructure

SIG

HYP

Reddit r/LocalLLaMA·May 21

Agent Execution Tax: new procurement metric for browser agent benchmarks?

WebVoyager benchmark on 720 browser agent tasks: MiniMax M2.5 costs 2.3× less per successful task than Gemini 2.5 Flash. GLM-5 achieves 57.1% accuracy, Kimi K2.5 shows 0% parse retry rate. Open-weight models outperform Gemini not through intelligence but reliability. True cost exceeds per-token pricing once retries compound.

AI Agents Benchmarks Open source

SIG

HYP

Reddit r/LocalLLaMA·May 21

Tencent Hy 30B/7B/1.8B

Tencent releases Hy-MT2, a multilingual translation model family in three sizes (1.8B, 7B, 30B-MoE) supporting 33 languages. The 1.8B model compressed to 440 MB via 1.25-bit quantization outperforms commercial APIs from Microsoft and Doubao. The 7B and 30B variants exceed DeepSeek-V4-Pro and Kimi K2.6 performance. Includes IFMTBench benchmark and WMT26 partnership.

Code generation Benchmarks Open source

SIG

HYP

Reddit r/MachineLearning·May 21

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

Masked diffusion language models (MDLMs) outperform autoregressive LLMs as world models for agentic RL. Fine-tuned SDAR-8B and WeDLM-8B achieve 4x gains on BLEU-1/ROUGE-L/MAUVE. GRPO training yields +15% absolute task-success on ScienceWorld, ALFWorld, AppWorld with Qwen3, Mistral, LFM2.5 in zero-shot transfer.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.LG·May 21

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

FusionCell predicts standard-cell performance by fusing routed layout geometry (DeiT encoder) and netlist topology (graph transformer). Trained on 19.5k 7nm cells (ASAP7), the model achieves 0.92% MAPE on delay/power metrics and accelerates characterization by orders of magnitude versus circuit simulation.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 21

TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

TabPFN-MT extends Prior-Data Fitted Networks to multitask in-context learning for tabular data. Trained on multi-target synthetics, the model captures inter-task dependencies and reduces inference from O(T) to O(1) forward passes. On 344 datasets (<1000 samples), it achieves rank 4.89 in multitask accuracy, competitive with single-task ensembles.

Papers Benchmarks RAG

SIG

HYP

Page 23 of 192

Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration

ConTact: Contact-First Antibody CDR Design via Explicit Interface Reasoning

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

Token-weighted Direct Preference Optimization with Attention

ACC: Compiling Agent Trajectories for Long-Context Training

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Anthropic loue Colossus 1 à 1,25 Md$/mois sur un parc xAI qui plafonne à 11% de capacité

Agent Execution Tax: new procurement metric for browser agent benchmarks?

Tencent Hy 30B/7B/1.8B

Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]

FusionCell: Cross-Attentive Fusion of Layout Geometry and Netlist Topology for Standard-Cell Performance Prediction

TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

Do No Harm? Hallucination and Actor-Level Abuse in Web-Deployed Medical Large Language Models

What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework

Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

FlowLM: Few-Step Language Modeling via Diffusion-to-Flow Adaptation

Data Scaling as Progressive Coverage of a Predictive Contribution Spectrum

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

Mechanics of Bias and Reasoning: Interpreting the Impact of Chain-of-Thought Prompting on Gender Bias in LLMs

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

MTR-Suite: A Framework for Evaluating and Synthesizing Conversational Retrieval Benchmarks

Direct Translation between Sign Languages

HRM-Text: Efficient Pretraining Beyond Scaling

Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task

On the limits and opportunities of AI reviewers: Reviewing the reviews of Nature-family papers with 45 expert scientists

DIVE: Embedding Compression via Self-Limiting Gradient Updates

Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning