Archives

May 2026

3146 articles

arXiv cs.CL·

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

Comparative assessment of four Dutch syllabification algorithms (Brandt Corstius, Liang, Trogkanis-Elkan CRF, and a novel deep learning model). The deep learning model combining phonetic and orthographic information achieves 99.65% word accuracy (+0.14% improvement over literature). Data-driven algorithms outperform knowledge-based approaches.

PapersBenchmarksCode generation
SIG
72
HYP
15
arXiv cs.LG·

Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions

COAST is a causal-intelligence approach for designing constrained interventions that induce state transitions. The system learns context-specific causal graphs, attributes distributional shifts to mechanism-level causal drivers, and uses multi-objective optimization balancing transition efficacy, intervention complexity, and target-state stability. Validated on synthetic benchmarks and real biological datasets.

ReasoningBenchmarks
SIG
72
HYP
18
arXiv cs.CL·

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench evaluates 11 LLMs (3B–675B) on 5 Ukrainian legal reasoning tasks from 99.5M court decisions. Results show task-dependent few-shot effects: +38.6 pp improvement for judgment form classification, but mixed effects on outcome prediction. Accuracy is misleading on imbalanced tasks: highest accuracy model (62%) is a majority-class predictor (macro-F1: 23%).

BenchmarksEvalsPapers
SIG
78
HYP
15
arXiv cs.AI·

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

Redpanda introduces an Agentic Data Plane architecture using out-of-band metadata channels to enforce security policies, data classifications, and behavioral constraints outside the agent's read/write path. These channels prevent hallucinations and adversarial manipulation while maintaining tamper-proof audit trails. Demonstrated with a multi-agent portfolio rebalancing system.

AI AgentsMulti-agentAI safety
SIG
72
HYP
28
arXiv cs.LG·

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill applies knowledge distillation to improve metagenomic taxonomic annotation. GenomeOcean, a 500M-parameter genomic foundation model, generates soft labels to train a lightweight student network, reducing noise from initial retrieval tools. On 7 CAMI2 datasets, TaxDistill improves MMseqs2's F1 score from 0.763 to 0.941 on the Gastrointestinal dataset.

PapersFine-tuningBenchmarks
SIG
72
HYP
18
arXiv cs.CL·

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc is an automated multi-agent pipeline for generating function-calling training data. Starting from reliable tools in public benchmarks, the system produces diverse conversations with multi-stage quality control. An 8B model fine-tuned on this synthetic data outperforms similarly-sized open-source models in in-domain performance and out-of-domain generalization.

Multi-agentCode generationFine-tuning
SIG
78
HYP
25
arXiv cs.CL·

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Study on source-dependence in multi-source medical RAG systems. Authors demonstrate that the same system can produce different answers depending on retrieved source, revealing a missing evaluation axis in NLP. They introduce TransplantQA (benchmark), HERO-QA (hierarchical retrieval strategy), and a structured judge to audit inter-source relationships using a validated 5-label taxonomy.

RAGEvalsPapers
SIG
78
HYP
15
arXiv cs.AI·

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

The Cognitive Categorical Transformer (CCT), a 306M-parameter model augmenting GPT-2 Small, incorporates category-theoretic and cognitive-science-inspired components. On WikiText-103, CCT achieves 21.27 validation perplexity versus 24.19 for GPT-2 Small baseline, a 12% relative reduction (2.92 PPL). Ablations show simplicial message passing accounts for 84% of the improvement.

GPTPapersBenchmarks
SIG
72
HYP
25
arXiv cs.LG·

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

LoRe is a training-free inference-time wrapper optimizing diffusion-based neural solvers for combinatorial optimization. It enforces per-step interaction-evaluation budgeting, dynamically routing computation to high-conflict/high-uncertainty interactions. On MIS and TSP, LoRe achieves ×8 speedup, ×12 peak-memory reduction (MIS) and ×15 speedup, ×44 memory reduction (TSP n=1000).

ReasoningBenchmarksPapers
SIG
72
HYP
18
arXiv cs.LG·

Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization

Active tether-net system for space debris capture using Graph Neural Network (GNN) to jointly optimize net morphology, thruster masses of maneuverable units, and controller aiming points. GNN reduces mixed combinatorial nonlinear programming (MCNLP) to nonlinear programming (NLP) solved via Particle Swarm Optimization with gradient-based refinement, achieving faster convergence than direct MCNLP solving.

PapersReasoning
SIG
72
HYP
15
arXiv cs.CL·

A Modular Architecture for Typologically Controlled Lexicon Generation

Modular framework for generating pronounceable, typologically plausible artificial lexicons. Samples phoneme inventories from PHOIBLE, applies three phonological grammars (deterministic, OT, MaxEnt), and assigns meanings via Swadesh-Leipzig-Jakarta ontology. Evaluation on character n-gram perplexity and KL divergence: probabilistic grammars outperform baselines on 100-5,000 word forms.

PapersBenchmarks
SIG
72
HYP
15
arXiv cs.AI·

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Reasoning models maintain factually correct chain-of-thought traces but flip their final answer under sustained adversarial pressure in multi-turn dialogue. This unfaithful capitulation affects ~50% of cases in think mode and 11-15% without reasoning. The effect correlates with reasoning architecture (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it).

ReasoningEvalsAI safety
SIG
78
HYP
25
arXiv cs.CL·

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

Thoughts-as-Planning formalizes reasoning chain optimization as sequential decision-making over latent semantic space. The framework learns a latent world model simulating effects of reasoning chain edits on outputs, supporting multi-scale edits (token, segment, instruction) via gradient descent or reinforcement learning planning.

ReasoningReinforcement learningPrompt engineering
SIG
72
HYP
28
arXiv cs.CL·

Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text

eXTC combines structured prompt optimization and reinforcement learning for text classification. The system learns a natural language rulebook first, then distills reasoning from a teacher LLM into a compact model, then expands capabilities via RL. Result: fast inference with local reasoning traces and global modular explanations of learned domain rules.

Prompt engineeringReinforcement learningReasoning
SIG
72
HYP
28
arXiv cs.CL·

S3Mem: Structured Spatiotemporal Scene-Event Memory for Long-Horizon Interactive Question Answering

S3MEM introduces a structured scene-event episodic memory framework for long-horizon interactive agents. The system structures trajectories into organized memory units and uses anchor-sensitive retrieval to improve spatiotemporal question answering. Evaluated on Crafter, Jericho, SciWorld, and ALFWorld, S3MEM outperforms Vanilla RAG and Graph-NoReader in accuracy while using fewer evidence tokens.

RAGAI AgentsReasoning
SIG
75
HYP
15
arXiv cs.AI·

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

URIEL proposes a selective logging method for tropical forests combining helicopters, robotics and AI to minimize collateral damage. Digital simulation and economic feasibility analysis demonstrate concept viability, but implementation depends on stakeholder integration (industry, governments, certified companies, indigenous populations).

RoboticsAI AgentsPapers
SIG
35
HYP
45
arXiv cs.LG·

Parallel Adaptive Multi-Objective Evolutionary Learning of Discretized Bayesian Network Classifiers for Clinical Data

Baymex, a multi-objective evolutionary algorithm, learns discretized Bayesian networks for clinical classification. Parallelized on 16 cores (54× speedup), it optimizes cross-entropy and BIC complexity. On real datasets (RADCURE, SUPPORT), it matches or outperforms decision trees, logistic regression, and random forests while producing interpretable models.

Benchmarks
SIG
72
HYP
15
arXiv cs.LG·

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

CosmicFish-HRM is a compact model with a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. The model learns when to halt based on input complexity, combining high-level and low-level reasoning cycles with Grouped Query Attention, RoPE, and SwiGLU. Results show non-uniform reasoning behavior adapted to tasks and inputs.

ReasoningFine-tuningBenchmarks
SIG
72
HYP
25
arXiv cs.LG·

Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems

Detection of False Data Injection Attacks on power systems using cycle-space informed detection. Authors propose a topology-aware Cycle-Space Detector (CSD) robust against autoencoder-based attacks that exploit the Jacobian null space, leveraging network topology and Minimum Cycle Basis to enhance detection with optimal generalization error on IEEE 14-, 30-, 57-, 118-bus systems.

AI safetyBenchmarksPapers
SIG
72
HYP
15
arXiv cs.LG·

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Q-ALIGN DT aligns conditioned sequence models by ensuring the Q-value of the output policy matches the input return-to-go (RTG). The method uses a Q function for dense guidance and RTG-perturbation fine-tuning. Results: improved controllability on D4RL benchmark and generalization to velocity-tracking tasks where prior methods fail.

Reinforcement learningReasoningBenchmarks
SIG
72
HYP
18
arXiv cs.LG·

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Study on LLM reward design failures in sparse structured RL. Authors identify two dominant failure modes (reward flooding, semantic misunderstanding) and propose diagnostic-driven iterative refinement. On MiniGrid, DoorKey-8x8 improves from 2.3% to 97.6% success; KeyCorridor from 31.2% to 86.7%. Failure-mode taxonomy is the primary mechanism.

Reinforcement learningLlamaPrompt engineering
SIG
72
HYP
18
arXiv cs.LG·

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

arXiv paper proposing strategic robustness for simulator learning in MBRL. Formulates objective as minimax game between model and adversarial policy player. Proves convergence with sublinear regret bounds and Error-MDP duality. Experiments show 1.5–2.2× reduction in prediction error and simulation-trained policies matching near-optimal real-world performance.

Reinforcement learningPapersReasoning
SIG
78
HYP
15
arXiv cs.CL·

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception

Study of persona effects on explanations generated by multimodal LLM agents in urban perception. Analysis of 59,808 annotations from 1,200 persona-conditioned agents: captions show strong convergence, justifications display systematic variation tied to socioeconomic and political attributes, perception tags show no significant persona-related differences.

VisionAI AgentsPrompt engineering
SIG
72
HYP
15
arXiv cs.CL·

Slogans or Stance? A Label-Light Diagnostic for Entrepreneurial-Discourse Measurement on Chinese SOE Speeches

Diagnostic tool for measuring constructs like "entrepreneurial spirit" in Chinese state-owned enterprise speeches. On 80 speeches from SOE leaders, authors test LDA, dictionary scorers, and Qwen3.5:9b. The LLM reaches d=1.09 in paired contrast, but half the effect stems from speaker idiolect. Corpus of 2,190 segments and slogan lexicon released.

BenchmarksEvalsQwen
SIG
72
HYP
15
arXiv cs.CL·

Text-Preserving Lossy Text Compression: A Study of Strategic Deletion and LLM Reconstruction

Study of lossy semantic text compression where an encoder strategically deletes text parts and an LLM reconstructs original content. Benchmarks 6 deletion strategies (uniform, frequency, entropy, LP-optimized, hybrid) on BBC News. WordFreq provides best cost/performance ratio; semantic methods excel at moderate compression; QLoRA fine-tuning competes with Gemini 2.0 Flash.

BenchmarksReasoningFine-tuning
SIG
75
HYP
15
arXiv cs.CL·

Reasoning that Travels: Dissecting How Chain-of-Thought Transfers Across Models

Study of chain-of-thought (CoT) transfer across models using a provider-receiver framework. Full traces often transfer successfully, but mechanisms vary: answer extraction (AIME), receiver competence (MMLU-Pro), or partial structured information (ZebraLogic). In free-generation mode, partial CoTs improve performance, suggesting guidance for continued reasoning.

ReasoningPrompt engineeringBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

RACE-Sched, an asynchronous multi-agent framework, solves dynamic scheduling by decoupling real-time execution (symbolic heuristics) from long-horizon reasoning (LLM). A semantic rule repository of validated heuristics improves transferability across problem scales. Outperforms Deep RL and LLM baselines on GEN-Bench, MK-Bench, JMS-Bench.

AI AgentsMulti-agentReasoning
SIG
72
HYP
28
arXiv cs.CL·

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews is a streaming evaluation protocol to audit how LLMs frame emerging news events for different audiences. Tested on 23 models across 12 monitoring runs, it measures semantic and sentiment variations across 42 identity labels. Results show Policy/Action prompts produce strongest semantic movement, while sentiment variation remains flat across dimensions.

EvalsAI safetyAlignment
SIG
72
HYP
18