Archives

May 2026

3147 articles

arXiv cs.AI·

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Microservice architecture for Document AI pipelines in production: classification, OCR, and structured field extraction via LLM. Processes thousands of multi-page documents per hour. Key findings: OCR dominates end-to-end latency (not LLM parsing), system saturation determined by shared GPU capacity. Concrete architectural patterns for production deployment.

InfrastructureCode generationRAG
SIG
72
HYP
15
arXiv cs.AI·

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

ReElicit is a Bayesian optimization framework for tuning system prompts using only aggregate feedback. An LLM dynamically elicits a compact, interpretable feature space, then a Gaussian process selects optimized target vectors refined into deployable prompts. Across 10 tasks with 30-evaluation budget, ReElicit outperforms aggregate-only prompt optimization baselines.

Prompt engineeringReasoning
SIG
72
HYP
25
arXiv cs.AI·

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

DecisionBench is a benchmark for evaluating emergent delegation in long-horizon multi-agent workflows. The substrate includes 11 models (7 vendor families), GAIA/tau-bench/BFCL tasks, and multi-axis metrics (quality, cost, latency, routing fidelity). Results show quality alone masks orchestration signals, and delivery channel dominates description content.

AI AgentsMulti-agentBenchmarks
SIG
82
HYP
15
arXiv cs.AI·

MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

MOCHA is a multi-objective optimization algorithm for refining LLM agent skills. It uses Chebyshev scalarization and exponential annealing to explore the complete Pareto front, including non-convex regions. On 6 tasks, MOCHA improves performance by 7.5% on average (up to 14.9% on FEVER) while discovering twice as many Pareto-optimal skill variants as baselines.

AI AgentsPrompt engineeringReinforcement learning
SIG
75
HYP
25
arXiv cs.CL·

IMLJD: A Computational Dataset for Indian Matrimonial Litigation Analysis

IMLJD is a dataset of 3,613 Indian court judgments on matrimonial disputes (IPC Section 498A, Protection of Women from Domestic Violence Act, CrPC Section 482). Data from Supreme Court of India (2000-2024, 1,474 cases) and Karnataka High Court (2018-2024, 2,139 cases). Quashing petition success rates: 57.6% at Supreme Court vs 39.7% at Karnataka High Court. Dataset, code, and knowledge graph released open-source.

BenchmarksPapersOpen source
SIG
72
HYP
15
arXiv cs.AI·

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Study of jailbreak attacks against Large Reasoning Models (LRMs) using reinforcement learning. Researchers show attack success rate correlates with model attention patterns. They propose an RL method incorporating attention signals into the reward function, tested on 5 LRMs with superior results in effectiveness, efficiency, and transferability.

ReasoningReinforcement learningAI safety
SIG
75
HYP
35
arXiv cs.AI·

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

arXiv paper introduces Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers skill levels its generative function was instructed to produce. On a two-stage adaptive assessment, the model recovers ~70% of intended variance (r=0.698) with systematic positive bias. GEA is strong (r>0.7) for syntactically verifiable skills but near zero for design-level skills.

EvalsReasoningAI safety
SIG
72
HYP
18
arXiv cs.CL·

How Do Document Parsers Break? Auditing Structural Vulnerability in Document Intelligence

Robustness study of Document Layout Analysis (DLA) pipelines used in RAG and long-document QA. Authors identify footprint bias and propose a lightweight auditing framework measuring block-level structural loss (B-SLR). On 1,000 pages with MinerU and PP-StructureV3, B-SLR correlates better with OCR instability (R²=0.727/0.916) than area-based metrics (R²=0.384/0.110).

PapersEvalsRAG
SIG
72
HYP
18
arXiv cs.LG·

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

PROWL introduces a KL-constrained adversarial curriculum to improve robustness of action-conditioned video world models. A policy exposes high-error trajectories of a diffusion-based model while a Prioritized Adversarial Trajectory (PAT) buffer re-ranks data by prediction error and learning progress. Evaluation on MineRL demonstrates improved robustness on out-of-distribution trajectories.

ReasoningReinforcement learningPapers
SIG
75
HYP
15
arXiv cs.LG·

PASC: Pipeline-Aware Conformal Prediction with Joint Coverage Guarantees for Multi-Stage NLP and LLM Pipelines

PASC is a conformal prediction method guaranteeing simultaneous coverage across all stages in multi-stage NLP pipelines (NER → NED → entity typing, RAG, agent chains). On CoNLL-2003, PASC achieves 96.4% end-to-end coverage vs 93.4% for Bonferroni and 86.5% for independent CP, 1.7x faster, and maintains robustness under distribution shift (WNUT-17, WikiNEuRal).

EvalsReasoningAI Agents
SIG
78
HYP
15
arXiv cs.LG·

Efficient Conditioning Why Pseudo Observation Batch Bayesian Optimization Works When It Does not

Theoretical study unifying batch selection methods in parallel Bayesian Optimization (Constant Liar, Kriging Believer, fantasy models). Authors identify efficient conditioning as key surrogate property of Gaussian Processes, proving generation of distinct points with separation of order l. Experimental validation on Hartmann6D, Ackley 8D, Levy10D and SVM hyperparameter tuning.

BenchmarksPapers
SIG
78
HYP
15
arXiv cs.LG·

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

Automated framework for generating fine-grained evaluation benchmarks for foundation models. Multi-agent pipeline with solution-graph-driven strategy improves ground-truth solution reliability. Three benchmarks generated (ML, Corporate Finance, Personal Finance) show lower error rates than MMLU/GSM8K. Evaluation of 12 models reveals performance differences missed by existing benchmarks.

BenchmarksEvalsMulti-agent
SIG
78
HYP
25
arXiv cs.CL·

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

Stepwise Confidence Attribution (SCA) diagnoses multi-step reasoning failures in closed-source LLMs by assigning step-level confidence from generated traces alone. Two methods: NIBS (non-parametric) and GIBS (graph-based). On mathematical reasoning and multi-hop QA, SCA reliably identifies error-prone steps and improves self-correction success by up to 13.5%.

ReasoningEvalsPapers
SIG
78
HYP
15
arXiv cs.LG·

Compositional Literary Primitives in Instruction-Tuned LLMs: Cross-Architectural SAE Features for Self, Style, and Affect

Study of literary primitives in Llama 3.1 8B-Instruct and Gemma 2 9B-IT using sparse autoencoders. Four feature classes identified: naming-gates (affect tokens), self cluster (first-person register), stylistic modulators, compositional emotions. Llama achieves 27/27 emotion coverage (Cowen-Keltner taxonomy), Gemma 23/27. Validated via 5-LLM judge panel.

LlamaGeminiFine-tuning
SIG
78
HYP
15