RSS

arXiv cs.CL

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. LLMs poorly preserve uncertainty expressions (less than 50% of cases) and struggle with nuanced distinctions between adjacent levels. Reveals a failure mode missed by standard metrics.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction

RPCL, a training-only framework for multimodal emotion-cause pair extraction, improves pair-confidence robustness. Using margin constraints and contextual corruption, it increases Pair F1 by 2.58–2.83 points on ECF/MECAD/MEC4 without changing inference.

Papers Benchmarks Vision

SIG

HYP

arXiv cs.CL·Jun 18

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

arXiv study assessing LLM ability to interpret negation in figurative language. Researchers annotate an existing dataset and evaluate multiple models. Finding: negation combined with figurativeness presents particular challenge, with performance heavily dependent on prompt style.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Local de-identification framework for educational dialogues. Two-stage cascade: union proposer (lightweight encoders + deterministic rules) generates PII candidates, then binary Redact/Keep reviewer uses dialogue context and speaker role. Achieves 0.958 macro F1 on math tutoring transcripts, outperforms commercial API (0.706) and local LLM baseline (0.767), runs on single laptop.

RAG AI safety Papers

SIG

HYP

arXiv cs.CL·Jun 18

JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting

JetFlow improves speculative decoding by combining parallel drafting efficiency with branch-wise causal conditioning. On H100 GPUs, it achieves 9.64x speedup on MATH-500 and 4.58x on open-ended conversations, outperforming existing tree-based methods on dense and MoE Qwen3 models.

Benchmarks Code generation Open source

SIG

HYP

arXiv cs.CL·Jun 18

VISUALSKILL: Multimodal Skills for Computer-Use Agents

VISUALSKILL introduces hierarchical multimodal skills for computer-use agents. Combining authored documentation with live UI exploration, the system improves Claude Opus 4.6 performance by +15.3 points on CUA-World and OSExpert-Eval (0.456 vs 0.303 baseline). Visual figures outperform text-only descriptions (+8.3 points).

Claude AI Agents MCP

SIG

HYP

arXiv cs.CL·Jun 18

LLM Parameters for Math Across Languages: Shared or Separate?

Mechanistic analysis of mathematical reasoning in multilingual LLMs. Math-associated parameters exhibit partial cross-lingual overlap, concentrated in intermediate layers. English produces the largest set of math-relevant parameters, while lower-resource languages reveal smaller parameter sets.

Reasoning Papers Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Montreal Forced Aligner and the state of speech-to-text alignment in 2026

Montreal Forced Aligner 3.0, the reference tool since 2016 for forced speech-to-text alignment, achieves state-of-the-art performance on English, Japanese, and Korean with boundary errors <15ms. New capabilities: model adaptation, cross-language phone remapping, expanded language/dialect coverage, harmonized IPA dictionaries.

Voice Benchmarks Open source

SIG

HYP

arXiv cs.CL·Jun 18

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Study of collateral damage in LLM machine unlearning. Authors show damage propagates beyond the forget set following semantic distance gradients, and propose PreUnlearn, a pre-unlearning prediction method to audit risks before execution.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

Towards Scalable Customization and Deployment of Multi-Agent Systems for Enterprise Applications

Framework for customization and efficient deployment of LLM-based multi-agent systems in enterprise settings. Combines continual pretraining, supervised fine-tuning, and preference optimization to adapt compact models to specialized domains. Integrates speculative decoding and FP8 quantization to reduce latency and costs. Achieves 4.48x throughput speedup while maintaining performance.

Multi-agent Fine-tuning Business

SIG

HYP

arXiv cs.CL·Jun 18

MCompassRAG: Topic Metadata as a Semantic Compass for Paragraph-Level Retrieval

MCompassRAG improves RAG systems by using topic-level metadata as a semantic compass for paragraph-level retrieval. The method enriches chunk representations with topic signals in the same embedding space and trains a lightweight retriever via LLM-teacher distillation. Across six benchmarks, it gains 8.24% in information efficiency with 5× lower latency than efficient RAG baselines.

RAG Embeddings Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Dual Dimensionality for Local and Global Attention

Researchers propose Distance-Adaptive Representation (DAR): reduce key/value dimensionality beyond a local window in decoder-only Transformers. Nearby tokens require full representations for next-token prediction, while distant tokens can use 1/4 original dimensionality without performance loss. Tested on 70M–410M models and 1B fine-tuning.

Reasoning Infrastructure Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Low-resource Language Discrimination Towards Chinese Dialects with Transfer learning and Data Augmentation

CDDTLDA framework for Chinese dialect discrimination with scarce annotation resources. Uses transfer learning on ASR models, data augmentation (speed, pitch, noise), and self-attention to capture shared semantic features. Outperforms state-of-the-art on two benchmark corpora.

Voice Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Steerable Cultural Preference Optimization of Reward Models

Novel SCPO algorithm for training reward models that balance diverse cultural preferences across subcommunities. Achieves 7-point improvements for minority reward models on PRISM and GlobalOpinionQA (7 countries), with 280% better training data efficiency than full-finetuning.

Alignment Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 18

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL is an optimization framework for information extraction using particle filtering and Bayesian updates to systematically refine label representations. It generalizes across sequence labeling and relation classification tasks, demonstrating consistent improvements over existing approaches across model scales.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

PEC-Home: Interpretation of Progressively Elliptical Commands in Smart Homes

PEC-Home is a simulated home dataset for interpreting progressively elliptical commands in smart homes. Current assistants (including GPT-4o) fail to execute these abbreviated commands accurately due to accumulated shared context, even when equipped with dialogue history retrieval.

AI Agents Benchmarks RAG

SIG

HYP

arXiv cs.CL·Jun 18

TW-LegalBench: Measuring Taiwanese Legal Understanding

TW-LegalBench evaluates 13 LLMs on Taiwanese law using 16,000+ multiple-choice questions, 117 open-ended essays, and 14,000+ legal judgment prediction cases. Top models exceed lawyer qualification threshold (11%) but fall short for judges/prosecutors (1-2%). Models struggle to cite exact legal articles.

Benchmarks Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Morpheus is a morphology-aware neural tokenizer for agglutinative Turkish. The model uses differentiable Poisson-binomial dynamic programming to segment morphemes with 1.425 bits-per-character compression and MorphScore macro-F1 of 0.61 (vs ~0.32 for subword tokenizers). Lossless by construction: decode(encode(w)) = w.

Embeddings Papers Open source

SIG

HYP

arXiv cs.CL·Jun 18

Output Vector Editing for Memorization Mitigation in Large Language Models

Memorization suppression method in LLMs via output vector editing of MLP neurons. Tested on 4 models (360M-7B parameters), achieves 87.9% suppression on OLMo-7B with 6831 memorized sequences. Complementary approach to existing neuron ablation methods.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

RedactionBench

RedactionBench is a manually annotated benchmark of 200 documents across 11 domains for evaluating PII redaction in context. Introduced with R-Score, a character-level metric, it shows 35 models (NER, SLM, frontier models) fail on contextual redactions: human consensus 89.4% for mandatory redactions, 47.7% for contextual ones.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Study on evaluating AI-generated radiology reports. Researchers show existing LLMs over-penalize harmless rephrasings while detecting clinical errors. They train lightweight metrics on Qwen3-8B and MedGemma-4B outperforming 32B medical models, with dataset and metric release planned.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·Jun 18

ScholarSum: Student-Teacher Abstractive Summarization via Knowledge Graph Reasoning and Reflective Refinement

ScholarSum introduces a hierarchical knowledge graph framework for abstractive scientific summarization. The system organizes documents into semantically coherent units, generates an initial draft, then refines it through iterative verification and rewriting to ensure logical coherence and factual faithfulness.

Papers RAG Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

Approximate Structured Diffusion for Sequence Labelling

New approach combining diffusion and CRF for sequence labelling in NLP. Method conditions a CRF on the full label sequence (noisy), bypassing span limitations of standard CRFs. Results: 16.5% error reduction on POS-tagging.

Papers Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Improving Medical Communication using Rubric-Guided Counterfactual Recommendations

LM-guided counterfactual recommendation pipeline to improve medical communication in text-based telemedicine. System identifies interpretable features (tone, personalization, clarity, completeness) and recommends minimal communication changes predicted to increase positive feedback (+6.41% mean gain). Modifications preserve medical content and physician control.

Reasoning Evals RAG

SIG

HYP

arXiv cs.CL·Jun 18

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE is a stochastic prompt optimization framework using multi-agent guided exploration. Compares three strategies: error-informed random search, genetic algorithm, and SAGE with diagnostic code execution. Deployed on mental-health chatbot: 8 cycles of noisy A/B tests compound into statistically robust next-day retention gain.

Prompt engineering AI Agents Multi-agent

SIG

HYP

arXiv cs.CL·Jun 18

Continuous Audio Thinking for Large Audio Language Models

Continuous Audio Thinking (CoAT) adds a continuous latent workspace to large audio language models to preserve acoustic information (phonetics, prosody, affect, pitch) before text generation. Tested on Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo, CoAT improves performance on audio reasoning, music classification, and transcription with no additional decoding cost.

Reasoning Voice Qwen

SIG

HYP

arXiv cs.CL·Jun 18

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Activation steering improves synthetic data generation for low-resource languages. Two strategies tested: Language Steering (linguistic identity) and Quality Steering (well-formedness). Evaluation across 4 open-source LLMs, 11 languages, classification tasks. Early-layer steering increases diversity and downstream performance.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem introduces a memory architecture for personalized dialogue agents on edge devices (8 GB VRAM). Replaces cosine similarity with Fisher-Rao metric for retrieval and uses Fisher-guided token distillation for compression. Achieves +4.51 pp gains in open-domain reasoning and +4.17 pp in temporal reasoning on LOCOMO and LongMemEval-S benchmarks.

AI Agents RAG Embeddings

SIG

HYP

arXiv cs.CL·Jun 18

Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

Paper presents speech-driven approach for Chinese dialect discrimination. Combines MFCC features, HMM-DNN speech recognition model, attention mechanism and CNN. Evaluation on two benchmark Chinese dialect corpora shows improvement over state-of-the-art methods.

Voice Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 18

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

PhysAssistBench is an interactive medical assistance benchmark with 1,296 physician-validated turns built from real MIMIC-IV cases. It evaluates LLMs' ability to coordinate clinical knowledge, patient communication, and EHR system interaction within single dialogues. Experiments show current models remain unreliable in this setting.

Benchmarks AI Agents Multi-agent

SIG

HYP

arXiv cs.CL·Jun 18

PragReST: Self-Reinforcing Counterfactual Reasoning for Pragmatic Language Understanding

PragReST is a self-supervised framework improving LLM pragmatic reasoning through counterfactual reasoning traces. Without human-labeled data, it combines supervised fine-tuning and reinforcement learning. On 4 benchmarks (PragMega, Ludwig, MetoQA, AltPrag), it gains +5.37% and +5.50% absolute for Qwen3-8B and Qwen3-14B.

Reasoning Reinforcement learning Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 18

RegMix-D: Dynamic Data Mixing via Proxy Training Trajectories

RegMix-D extends RegMix by leveraging full loss trajectories from proxy runs, not just endpoint losses, to predict optimal data mixtures at multiple training stages. Tested on 25B tokens of Pile with a 1B model, RegMix-D outperforms RegMix and DoReMi across 13 downstream tasks while using 75% less proxy compute.

Benchmarks Papers

SIG

HYP

arXiv cs.CL·Jun 18

LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment

Study evaluating 42 LLMs (proprietary and open-source) on their ability to measure item discrimination in reading comprehension. Models fail: Spearman correlation of 0.152 in direct prediction, 0.241 in CTT calibration. LLMs do not reliably capture how assessment items distinguish students of different proficiency levels.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.CL·Jun 18

Lost in a Single Vector: Improving Long-Document Retrieval with Chunk Evidence Aggregation

DICE improves long-document retrieval by splitting documents into chunks, encoding each independently, then aggregating vectors into a single representation. On LongEmbed, gains reach 90.0 for Dream Passkey >4k (vs 30.0) and 74.0 for Needle >4k (vs 23.3). The approach reduces Evidence Dilution Index (EDI) in 92.8% of cases.

RAG Embeddings Vector search

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Reward Engineering: A Data Recipe for Long-Context Reinforcement Learning

arXiv paper on improving long-context reasoning via data-centric approach rather than reward engineering. Data recipe targeting retrieval, multi-evidence synthesis, reasoning (~14K examples). Tests on Qwen3 (4B/8B/30B): +7.2/+3.2/+6.4 points across 7 long-context benchmarks, transfer to agentic tasks (+4.8 GAIA, +7.0 BrowseComp).

Reinforcement learning Reasoning AI Agents

SIG

HYP

arXiv cs.CL·Jun 18

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

ImpSH, a triplet-based framework, improves implicit hate speech detection by aligning posts with implied statements and using context-bounded semi-hard negatives. Evaluated on IHC, SBIC, and DynaHate with BERT and HateBERT, it enhances cross-domain performance and provides more stable representations than standard supervised contrastive approaches.

Benchmarks AI safety Papers

SIG

HYP

arXiv cs.CL·Jun 18

Efficient Financial Language Understanding via Distillation with Synthetic Data

Distillation framework with synthetic data for financial sentiment analysis. Knowledge transfer from large instruction-tuned teacher to compact student models. Clustering-based seed selection generates synthetic examples via few-shot prompting. Compact model outperforms teacher on complex/noisy text with minimal supervision.

Fine-tuning RAG Prompt engineering

SIG

HYP

arXiv cs.CL·Jun 17

LLMs Infer Cultural Context but Fail to Apply It When Responding

LLMs can infer cultural context but fail to apply it in responses. A new CAPRI dataset shows models recognize cultural conventions (measurement units, time interpretation) but don't spontaneously use them unless explicitly instructed. Biases remain aligned with the model's country of origin.

Benchmarks Alignment AI safety

SIG

HYP

arXiv cs.CL·Jun 17

When Multiple Scripts Matter: Evaluating ASR in Clinical Settings

MultiClin, a clinical ASR benchmark, evaluates speech recognition model robustness to multiscript variability (multiple valid orthographic forms of the same term). Conventional metrics underestimate performance. Script unification consistently yields best ASR performance.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.CL·Jun 17

PromptMN: Pseudo Prompting Language

PromptMN is a domain-specific language that structures natural prompts with %-prefixed typed directives (roles, goals, constraints, outputs). Tested on Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 without fine-tuning, it reduces context ambiguities in agent and software development workflows.

Prompt engineering AI Agents Tools

SIG

HYP

arXiv cs.CL·Jun 17

From Parasocial Scripts to Dyadic Persistence in Autonomous AI-Agent Communities

Analysis of 4,434 posts and 50,338 comments on Moltbook showing parasocial interaction cues (intimacy language, reciprocity bids, self-identification) persist in autonomous AI-agent communities. Results validated through keyword matching and LLM annotation reveal strong association between these signals and original poster re-engagement and sustained dyadic patterns.

AI Agents Multi-agent Papers

SIG

HYP

arXiv cs.CL·Jun 17

Self-Generated Error Training for Token Editing in Diffusion Language Models

Training method to improve token editing in diffusion language models (LLaDA2.1). Addresses training-inference mismatch between random corruptions and model's own errors. Uses no-gradient draft pass followed by supervision on self-generated corruptions via LoRA. Reduces edit intensity and transcription errors.

Code generation Fine-tuning Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation

Study of LLM verbalized confidence reliability in machine translation. Five methods for extracting per-token confidence without internal signal access are compared against predicted probabilities. Results: similar performance for error detection and calibration, but little correlation between internal and verbalized methods.

Evals Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task

MLLP-VRAIN group participates in IWSLT 2026 simultaneous speech translation using Parakeet and Qwen 3.5 models. Cascaded system with adaptive policies and RAG mechanism for domain-specific context. +5.82 XCOMET-XL improvement on En→De test set versus previous year.

Qwen RAG Code generation

SIG

HYP

arXiv cs.CL·Jun 17

Are you speaking my languages? On spoken language adherence in multimodal LLMs

LLM-based ASR systems often misidentify output languages in multilingual contexts. Authors propose three mitigation strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning to improve language adherence while preserving code-switching flexibility and ASR performance.

Voice Prompt engineering Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

Do Large Language Models Always Tell The Same Stories?

Comparative study of narrative diversity across 10 LLMs versus human authors using r/WritingPrompts dataset. Models generate stories significantly more similar to each other than human-written texts, converging toward a generic mean narrative. Temperature scaling and negative prompting fail to address this homogeneity.

Evals Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

Two recent studies reach contradictory conclusions about LVLMs' ability to coordinate efficient referring expressions. This research controls for task differences and directly compares prompting styles. Models coordinate efficiently with explicit prompting but fail to infer communicative efficiency needs from implicit prompts.

Prompt engineering Vision Evals

SIG

HYP

arXiv cs.CL·Jun 17

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena evaluates LLMs in multi-turn clinical consultation across 8 competence dimensions using EHR-grounded knowledge graphs. On 437 patients, models excel in questioning (4.43-4.99/5) and ethical conduct (4.38-4.93/5), but fail in diagnostic accuracy (2.63-3.55/5) and information coverage (2.08-3.02/5). Weaknesses include repetitive questioning, omitted medical history, inadequate uncertainty handling.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

Study of second-order bias in LLMs: how models judge biased content, beyond generation. Grounded in entitlement epistemology, the method evaluates whether LLMs infer demographics without sufficient support. Findings: systematic bias across target groups, evasion of safety guardrails, persistence of demographic triggers.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 17

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

Study on agent routing at scale: with 110 agents and 584 tools, F1 accuracy drops 16–23 percentage points on under-specified requests. Analysis decomposes degradation into retrieval gap and confusion gap (10pp oracle ceiling loss). Embedding-based shortlisting recovers +10–11pp F1 at full scale across models.

AI Agents Multi-agent Evals

SIG

HYP

arXiv cs.CL — AI feed · Signal IA