Page 21 of 138

AllHigh signalRecent
5520 articles
arXiv cs.CL·

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

STING is an automated red-teaming framework measuring multi-turn illicit assistance in LLM agents. It constructs step-by-step illicit plans grounded in benign personas and uses judge agents to track completion. Multilingual evaluation across six non-English languages shows attack success does not consistently increase in lower-resource languages, diverging from chatbot findings.

AI AgentsAI safetyEvals
SIG
78
HYP
25
arXiv cs.CL·

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

CounterRefine adds a lightweight repair layer for RAG: after an initial answer, the system issues answer-conditioned queries to retrieve candidate-specific counterevidence, then applies a deterministically-validated KEEP/REVISE refinement step. On SimpleQA, improves baseline by up to 5.8 correct-rate points; modifies 5.6% of outputs with 180 beneficial changes versus 8 harmful ones.

RAGReasoningEvals
SIG
78
HYP
15
arXiv cs.AI·

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

Behavioral evaluation methodology for agentic AI systems: scoring intermediate decisions via LLM judge ensemble across 6 dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery). Behavioral score correlates at rho=0.72 with Sharpe ratio. Closed-loop reinforcement (SAC) reduces MAPE from 0.61% to 0.54% on 2017-2025 test set.

AI AgentsReinforcement learningEvals
SIG
78
HYP
15
arXiv cs.AI·

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Softmax-attention transformers can implement preconditioned Richardson iteration for in-context Gaussian kernel regression. Authors construct a single-head transformer with O(log(1/ε)) blocks achieving ε-accurate prediction on prompts of length N, where softmax attention produces a Gaussian-kernel operator and ReLU MLP layers perform local scalar arithmetic.

ReasoningPapersBenchmarks
SIG
78
HYP
15
arXiv cs.CL·

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

Study reveals visibility asymmetry in multilingual datasets: 118 languages (59% of 200 most-spoken) have zero catalogued datasets per LRE Map and LDC. Using LLM-assisted citation-mining on Semantic Scholar, authors identify 609 unique datasets across 53 low-visibility languages, 356 openly accessible. Data scarcity is a documentation and discoverability issue, not just production.

BenchmarksOpen source
SIG
78
HYP
15
arXiv cs.CL·

Temporal Decay of Co-Citation Predictability: A 20-Year Statute Retrieval Benchmark from 396M Ukrainian Court Citations

UA-StatuteRetrieval: 20-year benchmark on 396M Ukrainian court citations. Co-citation predictability declines 33-47% (Adamic-Adar MRR 0.43→0.29). Non-uniform decay: criminal law stable (~0.40), civil law collapses (0.35→0.15) post-2017 reform. Mid-frequency articles (1K-10K citations) lose 50% predictability. E5-large detects 4.3% semantic drift.

BenchmarksEmbeddingsRAG
SIG
78
HYP
15
arXiv cs.CL·

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

BacktestBench is the first large-scale benchmark for automated quantitative backtesting, containing 18,246 annotated QA pairs across 6 million real market records. AutoBacktest, a multi-agent system, translates natural language strategies into reproducible backtests via a Summarizer, SQL Retriever, and Python Coder. Evaluation on 23 mainstream LLMs.

BenchmarksMulti-agentCode generation
SIG
78
HYP
25
arXiv cs.CL·

LongMINT: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

LongMINT is a benchmark evaluating agents' memory management in long contexts (up to 1.8M tokens) with multi-target interference. 15.6k QA pairs across 4 domains (state tracking, dialogue, Wikipedia revisions, GitHub commits). 7 systems tested (long-context LLMs, RAG, agent frameworks) achieve 27.9% average accuracy, bottlenecked by retrieval and memory construction.

AI AgentsBenchmarksRAG
SIG
78
HYP
15
arXiv cs.CL·

Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

Formal verification framework for LLM pipelines using Lean 4 certificates. Three certificate families (conflict-aware bilattice, embedding sensitivity, Hoare-style agent action) plus two operators (Maximal Certifiable Residue, Compositional Stability) for high-stakes deployments (regulated finance, clinical support, agentic systems). Compiled artifact covers 22 certificate types, 17/46 declarations axiom-free.

ReasoningAI safetyAI Agents
SIG
78
HYP
15
arXiv cs.CL·

Scale Determines Whether Language Models Organize Representation Geometry for Prediction

Study on how representation geometry organization in language models depends on scale. Subspace PGA metric tests alignment of intermediate geometry with unembedding matrix readout. Small models (≤1024) progressively lose organization at late layers during training, while large models (≥2048) preserve it throughout. Scale determines how geometry organizes for prediction.

PapersReasoningEvals
SIG
78
HYP
15