Archives

May 2026

3148 articles

arXiv cs.AI·

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

Large-scale study of 64,380 SWE-bench runs across 126 agent configurations (43 frameworks × LLMs). Behavioral rules derived from single frameworks do not transfer: the same signal (e.g., error rate) correlates positively with issue resolution in 47 configs and negatively in 48. Framework identity explains 64% of variance vs. 10% for LLM family.

AI AgentsBenchmarksCode generation
SIG
82
HYP
15
arXiv cs.AI·

CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories

CommitDistill is an open-source Python prototype extracting typed knowledge units (Facts, Skills, Patterns) from local git history via deterministic regex and exposing them through a TF-IDF retriever. Tested on 5 repositories (25k commits), it achieves 0.750 hit-rate at 256-character budget versus 0.333 for BM25. No statistically detectable improvement on time-travel bug-fixes in LLM-as-judge evaluation.

Code generationRAGAI Agents
SIG
72
HYP
18
arXiv cs.AI·

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

Empirical study of privacy-leakage chains via prompt injection in black-box chatbot environments. Researchers analyze how attackers can hijack LLM agent tasks by injecting malicious content into external sources. They introduce the 'exemplification' technique and demonstrate a functional data-exfiltration chain combining prompt injection, jailbreaking, and web-tool invocation.

AI AgentsPrompt engineeringAI safety
SIG
72
HYP
25
arXiv cs.AI·

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

First systematic study of source attribution for generated 3D assets. Researchers build a benchmark covering 22 3D generators and propose a hierarchical multi-view multi-modal Transformer detecting fingerprints (cross-view inconsistencies, geometric artifacts, frequency-domain signatures). Results: 97.22% accuracy under full supervision, 77.17% with only 1% training data.

VisionBenchmarksAI safety
SIG
78
HYP
25
arXiv cs.AI·

Parameterized 4-Qubit EWL Quantum Game Circuits with Dirac-Solow-Swan Hamiltonian Integration for Quadruple Helix Disruptive Innovation Recommender Systems

Proposes parameterized 4-qubit EWL quantum game circuit for recommender systems in quadruple helix innovation ecosystems. Uses real CORDIS Horizon Europe data, integrates Dirac-Solow-Swan Hamiltonian to simulate capital dynamics under disruptive innovation. Circuit depth 11, NISQ-compatible, Qiskit implementation provided.

BenchmarksPapers
SIG
35
HYP
72
arXiv cs.CL·

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA is an LLM-based autonomous agent for knowledge graph construction and retrieval-augmented generation. It replaces stateless batch pipelines with a ReAct loop supporting full CRUD operations, hybrid KG-vector synchronization, and evidence-anchored verification linked to source text. Experiments on QASPER show measurable gains in answer and evidence quality.

AI AgentsRAGReasoning
SIG
72
HYP
28
arXiv cs.AI·

Quantum Sidecar Architectures for Hybrid AI Training and Inference: Stateful Protected Registers, Stateless Reset-and-Reprepare Circuits and Quantum Weight-State Outlook

Proposes quantum sidecar architectures for hybrid AI training and inference. Two operating modes: stateful protected-register mode (QND readout with ancilla) and stateless reset-and-reprepare mode (QAOA-style circuits). Simulations on 2/4/6/8 protected qubits. Positions quantum sidecars as bounded signal generators for optimization, expert selection, and routing.

ReasoningAI AgentsInfrastructure
SIG
45
HYP
35
arXiv cs.AI·

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

Study of memorization-generalization coexistence in over-parameterized neural networks. With 80% label noise on arithmetic tasks, models memorize noisy labels while maintaining an internal generalization structure. Frequency-based extraction achieves near-perfect accuracy. Task-agnostic partitioning into generalization/memorization components proposed.

PapersEvalsAlignment
SIG
72
HYP
15
arXiv cs.CL·

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

Critical study of LLM-based trading agents (FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, FLAG-Trader). Reported Sharpe ratios do not constitute deployment evidence: temporal contamination, unmodeled frictions, and insufficient predictive calibration invalidate results. Proposes P1-P6 reporting protocol and modular architecture with LLMs as auditable information interfaces.

AI AgentsBenchmarksPapers
SIG
78
HYP
15
arXiv cs.AI·

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

Study of prompt compression on LLaDA, an 8B-parameter DLLM, using LLMLingua-2. Evaluation on GSM8K, DUC2004, ShareGPT at 2× compression ratio shows semantic preservation does not guarantee stability in diffusion models: mathematical reasoning degrades substantially while summarization remains robust. Autoregressive compression methods do not transfer uniformly to DLLMs.

Prompt engineeringBenchmarksReasoning
SIG
72
HYP
15
arXiv cs.AI·

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

CGPO (Curriculum Group Policy Optimization) improves text-to-image model training via adaptive curriculum based on reward variance. Method prioritizes partially-mastered prompts (high variance) and balances categories through proportional fairness optimization. Gains validated on GenEval, T2I-CompBench++, DPG Bench.

Image generationReinforcement learningBenchmarks
SIG
72
HYP
28
arXiv cs.AI·

Bayesian-Monte Carlo Schedule Updating for Construction Digital Twins: A Probabilistic Framework for Dynamic Project Forecasting

Bayesian-Monte Carlo probabilistic framework for dynamic construction project schedule updating. Models activity durations with lognormal distributions, updates them via Bayesian inference, and propagates uncertainty through Monte Carlo simulation. Demonstrates improved forecasting accuracy over deterministic CPM methods on PSPLIB benchmarks.

ReasoningBenchmarks
SIG
72
HYP
15
arXiv cs.CL·

Multilingual OCR-Aware Fine-Tuning and Prompt-Guided Chain-of-Thought Reasoning for Multimodal Large Language Models

Multilingual OCR-aware fine-tuning framework for MLLMs combining synthetic OCR-to-translation data generation, LoRA-based SFT, and structured visual chain-of-thought reasoning. Significantly improves extraction of small, blurred, occluded text on receipts, menus, documents under degraded visual conditions. Outperforms GPT-5 and Gemini on OCR grounding and hallucination reduction.

VisionReasoningFine-tuning
SIG
72
HYP
28
arXiv cs.AI·

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair is a memory-augmented agentic framework for repository-level vulnerability repair. It combines three memory layers (History-Fix, Security-Pattern, Refinement-Trajectory) with an iterative refinement loop. Evaluated on SEC-Bench, PatchEval, and Multi-SWE-bench, MemRepair achieves 58.0%, 58.2%, and 30.58% resolution rates, outperforming OpenHands, SWE-agent, and InfCode-C++.

AI AgentsCode generationAI safety
SIG
82
HYP
18
arXiv cs.AI·

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

DiagEval is a trajectory-conditioned diagnostic evaluation protocol for GUI agents testing LLM-generated interactive software. It reuses failed trajectories to determine whether failures stem from the evaluator or the software itself. On WebDevJudge-Unit and RealDevBench, DiagEval recovers 45.6-62.1% of false negatives and improves accuracy from 69.9% to 78.3% and from 65.0% to 81.6%.

AI AgentsEvalsCode generation
SIG
72
HYP
18
arXiv cs.AI·

Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning

arXiv paper introducing Progressive Generalization Augmentation (PGA) to improve robustness of agricultural RL systems. Coupled RND-PPO architecture + hierarchical noise injection. Results: +8.43% yield, +16.42% nitrogen use efficiency vs BERT-DQN in Florida; 94.4% performance retention under combined perturbations.

Reinforcement learningPapersBenchmarks
SIG
72
HYP
28
arXiv cs.AI·

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

FML-Bench is a benchmark of 18 ML tasks across 10 domains evaluating 6 AI research agents. Key findings: strategy complexity alone does not ensure performance (greedy hill-climber matches tree-search); effectiveness depends on improvement opportunity structure; an adaptive agent detecting stagnation outperforms others. Includes 12 process-level behavioral metrics.

AI AgentsBenchmarksReasoning
SIG
82
HYP
15
arXiv cs.AI·

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

New arXiv paper proposing HRC (Hybrid Reward-Cyclic), a reward model explicitly decomposing human preferences into transitive (scalar) and cyclic (vector) components via game theory. Introduces DSPPO (Dynamic Self-Play Preference Optimization) for alignment. Results: +1.23% on RewardBench 2 vs GPM, 44.75% win-rate on AlpacaEval 2.0 with Gemma-2B-it.

Reinforcement learningAlignmentPapers
SIG
72
HYP
25
arXiv cs.AI·

Learning Higher-Order Structure from Incomplete Spatiotemporal Data: Multi-Scale Hypergraph Laplacians with Neural Refinement

Multi-Scale Hypergraph Laplacians (MSHL): two-stage framework for imputing incomplete spatiotemporal sensor network data. Discovers higher-order structure via multi-scale hypergraphs, then refines with hypergraph-conditioned residual network. Theoretical guarantees and evaluation on real traffic networks with structured outages.

PapersBenchmarksInfrastructure
SIG
72
HYP
15