RSS

arXiv cs.CL

https://arxiv.org/list/cs.CL/recent

arXiv cs.CL·

Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling

Systematic audit of FOLIO and MALLS benchmarks reveals 39% and 36% errors in FOL formalizations respectively. Authors release corrected annotations and an LLM-based framework to guide manual relabeling, achieving 90% dataset accuracy by reviewing <24% of instances versus >70% for unguided review. Testing on Gemma 31B, Qwen3-30B, and GPT-4o-mini shows +9 to +22 percentage point accuracy gains.

BenchmarksEvalsReasoning
SIG
82
HYP
15
arXiv cs.CL·

Toward Robust In-Context Learning: Leveraging Out-of-distribution Proxies for Target Inaccessible Demonstration Retrieval

DOPA, a demonstration retrieval framework, uses an OOD proxy to approximate the inaccessible target domain and guide selection of relevant demonstrations. A Mahalanobis distance-based global diversity constraint ensures sufficient variety among retrieved examples. Positive results across multiple LLMs and tasks under severe distribution shift.

Prompt engineeringBenchmarksPapers
SIG
72
HYP
18
arXiv cs.CL·

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

arXiv study on LLM adaptation limits for annotation tasks. Toxicity detection experiments across diverse datasets show 66% of zero-shot errors resist correction via prompting (rescue rate 34.8%). Models follow misaligned definitions while maintaining confidence. Definition-Specific Familiarity (DSF) metric correlates with performance (r=+0.41), outperforming memorization metrics.

Prompt engineeringEvalsBenchmarks
SIG
78
HYP
15
arXiv cs.CL·

SPADER: Step-wise Peer Advantage with Diversity-Aware Exploration Rewards for Multi-Answer Question Answering

SPADER is an RL framework for tool-augmented LLM agents in Multi-Answer QA. It introduces Step-wise Peer Advantage (SPA) for fine-grained credit assignment over long trajectories, and a diversity-aware exploration reward promoting rare entity discovery. Evaluated on QAMPARI, Mintaka, WebQSP, QUEST: improves recall and F1 vs prompting and supervised RL baselines.

AI AgentsReinforcement learningReasoning
SIG
78
HYP
18
arXiv cs.CL·

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Study on preventing catastrophic forgetting during continual pretraining of multilingual language models. Authors propose five parameter alignment strategies (layer freezing, regularization, post-hoc reversion, model merging) tested across 32 languages and four evaluation axes. Parameter alignment substantially reduces forgetting while maintaining language acquisition.

Fine-tuningPapersBenchmarks
SIG
78
HYP
15
arXiv cs.CL·

AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

AEyeDE introduces an attention-based attribution framework for detecting AI-generated text using attention matrices from a proxy Transformer model. A lightweight CNN learns discriminative representations from these attribution maps. The method outperforms text-only baselines, shows strong generator-specific detection, and demonstrates robustness under cross-dataset transfer and spelling perturbations.

PapersAI safetyEvals
SIG
72
HYP
18
arXiv cs.CL·

TCAR-Gen: Temporal Graph Retrieval with Evidence Fusion for Knowledge-Grounded Generation

TCAR-Gen combines query-conditioned graph neural networks, temporal evidence fusion, and chain-of-trees reasoning for retrieval-augmented generation. Achieves 0.3738 Recall@5 on Victorian Crime Diaries benchmark, outperforming Vanilla RAG, Temporal RAG, and GraphRAG variants. Cross-model evaluation across GPT-OSS 20B to TinyLlama 1.1B shows robust retrieval coverage at smaller scales.

RAGReasoningBenchmarks
SIG
72
HYP
18
arXiv cs.CL·

Graph-Augmented Retrieval for Cross-Entity Financial Sentiment Analysis: A Comparative Study

Comparative study of a two-hop Graph-RAG architecture versus standard vector-only RAG for cross-entity financial sentiment analysis. On 100 queries (30 direct, 70 relational), Graph-RAG improves entity recall (+6.4%, p<0.001) and answer relevancy for complex queries (+11.7%), with no quality degradation, modest 22.6% latency increase but 80% variance reduction.

RAGBenchmarksPapers
SIG
78
HYP
15
arXiv cs.CL·

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

BiGRU architecture enhanced with KAN (Kolmogorov-Arnold Network) block for legal document classification and summarization in low-resource multilingual setup. Evaluation on Bengali/English/transliterated corpus from Bangladesh: 67.96% accuracy in classification (F1=0.65), ROUGE-1/2/L scores of 0.38/0.23/0.31 in summarization. Ablation study shows KAN improves classification from 57.34% to 67.96%.

BenchmarksFine-tuning
SIG
45
HYP
25
arXiv cs.CL·

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF is a multilingual corpus from 2 countries (Algeria, Tunisia) covering fake news, authentic narratives, comments, and debunking. Includes MSA, Algerian/Tunisian dialects, Arabizi, French, English, and code-switching. Analysis shows fake news relies on emotionally charged narratives and sensational framing, while debunking adopts a factual, verification-oriented style.

PapersBenchmarksAI safety
SIG
72
HYP
18
arXiv cs.CL·

Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs

Audit of 7 LLMs (US/China) on 2,520 responses to 60 legal-administrative prompts in English and Mandarin. Models default to the institutional framework of input language: 74.5% of English responses adopt US framework, 53.3% of Chinese responses adopt China framework. Risk of jurisdictional misselection when preferred language differs from applicable jurisdiction.

BenchmarksAI safetyRegulation
SIG
78
HYP
15
arXiv cs.CL·

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.

AI safetyAlignmentPapers
SIG
82
HYP
25
arXiv cs.CL·

Auditing LLM Benchmarks with Item Response Theory

Item Response Theory-based method detects mislabels in 7 LLM benchmarks at 95% precision on top 200 examples across 114 models. Analysis reveals errors from mechanical labeling heuristics, inherited annotation mistakes, and fundamentally ambiguous items. Reward models specialize in stylistic preference over factual knowledge; one frontier model agrees with detected mislabels at 78% accuracy versus 38% for peers.

BenchmarksEvalsPapers
SIG
78
HYP
15