Page 20 of 138

AllHigh signalRecent
5509 articles
arXiv cs.AI·

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

New approach to generate formal theorem proving challenges by leveraging theoretical computer science (TCS). Framework automatically synthesizes problem-proof pairs in Lean4 and Markdown across two domains: Busy Beaver and Mixed Boolean Arithmetic. DeepSeekProver-V2-671B achieves 57.5% on Busy Beaver but only 12% on Mixed Boolean Arithmetic, revealing major gaps in long-form proof generation.

ReasoningBenchmarksPapers
SIG
78
HYP
15
arXiv cs.CL·

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2 is a post-training quantization framework for LLMs maintaining performance under extreme compression (2-4 bits). It combines adaptive mixed-precision strategy guided by gradient information and lightweight stabilization techniques. Results show ~1% performance gap at 4.5 bits average in mixed MXFP settings, with substantial improvements in 2-bit weight-only quantization.

Fine-tuningBenchmarksOpen source
SIG
78
HYP
15
arXiv cs.CL·

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

Colluding LLM agents manipulate victim beliefs by coordinating truthful evidence fragments through public channels without covert communication. The Generative Montage framework (Writer-Editor-Director) constructs deceptive narratives via adversarial debate. Attack success rates reach 74.4% on proprietary models and 70.6% on open-weights across 14 LLM families. Advanced reasoning models show higher susceptibility.

AI AgentsMulti-agentAI safety
SIG
78
HYP
35
arXiv cs.CL·

Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

SDRL (Self-Debate Reinforcement Learning) trains LLMs to solve problems standalone AND benefit from multi-agent debate. The model samples multiple solutions, constructs debate context with diverse reasoning paths, then jointly optimizes initial and debate-conditioned responses. Results: consistent MAD performance gains across benchmarks and agent configurations.

ReasoningReinforcement learningMulti-agent
SIG
78
HYP
22
arXiv cs.AI·

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2 is a post-training quantization framework for LLMs maintaining performance under extreme compression (2-4 bits). It combines adaptive mixed-precision strategy guided by gradients and lightweight stabilization techniques. Results show ~1% performance gap at 4.5 bits average in mixed MXFP, with substantial improvements in challenging 2-bit weight-only quantization.

Fine-tuningBenchmarksInfrastructure
SIG
78
HYP
18
arXiv cs.AI·

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL uses LLMs to generate contextualized synthetic training examples to address scarcity of annotated data in biomedical entity linking. The framework achieves state-of-the-art on MedMentions (English), QUAERO (French), and SPACCC (Spanish), reaching full human supervision performance with 60% less annotated data. An LLM-as-a-judge protocol evaluates clinical validity.

PapersBenchmarksRAG
SIG
78
HYP
15
arXiv cs.AI·

LaDi-RL: Latent Diffusion Reasoning Prevents Entropy Collapse in Reinforcement Learning

LaDi-RL optimizes LLM reasoning via RL in latent space using diffusion. Instead of optimizing token sequences, the method generates latent reasoning trajectories through iterative denoising. It solves credit assignment (rewards observed after decoding) via hierarchical latent-text rollouts. Gains: +9.4% code generation, +5.7% math reasoning on pass@1.

Reinforcement learningReasoningCode generation
SIG
78
HYP
25
arXiv cs.AI·

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST introduces targeted data selection for instruction tuning by replacing axis-aligned scaling with robust subspace alignment via SVD. It recovers task-specific subspaces from validation gradients and scores examples by alignment with target directions. GIST matches or outperforms state-of-the-art baselines using only 0.29% storage and 25% computational time.

Fine-tuningReinforcement learningPapers
SIG
78
HYP
15
arXiv cs.AI·

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

SPOT (Surgical Post-Training) is an on-policy distillation framework that injects reasoning capabilities into LLMs while preserving prior knowledge. With only 4k rectified math pairs, it improves Qwen3-8B by 6.2% on average in 16 minutes on 8x H800 GPUs. The approach uses KL-constrained reward formulation to mitigate catastrophic forgetting.

Fine-tuningReinforcement learningReasoning
SIG
78
HYP
25
arXiv cs.CL·

QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation

QuCo-RAG proposes dynamic RAG grounded in pre-training corpus statistics rather than model-internal signals. It identifies low-frequency entities and verifies their co-occurrence in 4 trillion tokens using Infini-gram. On multi-hop QA benchmarks, it gains 5–12 EM points over baselines with OLMo-2, and up to 14 points on Llama-3, Qwen2.5, GPT-4.

RAGReasoningBenchmarks
SIG
78
HYP
18
arXiv cs.CL·

Rethinking Table Pruning in TableQA: From Sequential Revisions to Gold Trajectory-Supervised Parallel Search

TabTrim, a novel table pruning framework for TableQA, replaces sequential revisions with gold trajectory-supervised parallel search. The system uses intermediate sub-tables from gold SQL queries to train a pruner and verifier. TabTrim-8B achieves 73.5% average accuracy, outperforming strongest baselines by 3.2% (79.4% on WikiTQ, 61.2% on TableBench).

BenchmarksReasoningPapers
SIG
78
HYP
25
arXiv cs.CL·

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

STING is an automated red-teaming framework measuring multi-turn illicit assistance in LLM agents. It constructs step-by-step illicit plans grounded in benign personas and uses judge agents to track completion. Multilingual evaluation across six non-English languages shows attack success does not consistently increase in lower-resource languages, diverging from chatbot findings.

AI AgentsAI safetyEvals
SIG
78
HYP
25