Page 10 of 137

AllHigh signalRecent
5471 articles
arXiv cs.CL·

BioELX: Cross-lingual Biomedical Entity Linking via Alias-based Retrieval and LLM Ranking

BioELX is a two-stage cross-lingual biomedical entity linking system requiring no annotated training data. It enriches SapBERT with Wikidata-derived multilingual aliases and uses an LLM for context-aware disambiguation. On five benchmarks, it achieves +19.2 Recall@1 on XL-BEL, with major gains for low-resource languages (Turkish +21.6, Korean +22.1, Thai +30.8).

BenchmarksPapersRAG
SIG
78
HYP
15
Reddit r/LocalLLaMA·

I built a 103B-token Usenet corpus (1980–2013) — pre-web, human-only, zero AI contamination. Got strong traction on r/ML, thought this community would find it useful.

Complete Usenet corpus (1980–2013) released for local fine-tuning: 103.1B tokens, 408M posts, zero AI contamination. Pre-SEO, pre-algorithm internet writing across 33 years. Organized by domain hierarchies (comp.*, sci.*, rec.*). Free samples available, full corpus under license.

Fine-tuningOpen sourceBenchmarks
SIG
78
HYP
25
arXiv cs.LG·

Online Learning on Hidden-Convex Losses via Algorithmic Equivalence: Optimal Regret, Geometric Barrier, and Bandit Feedback

Study of adversarial online learning on hidden-convex losses (nonconvex losses becoming convex after reparameterization). Authors prove online gradient descent (OGD) achieves optimal Θ(√T) regret, improving prior O(T^2/3) result. They characterize necessary-and-sufficient Hessian compatibility condition and extend analysis to bandit feedback with O(T^3/4) regret.

PapersReinforcement learningBenchmarks
SIG
78
HYP
08
arXiv cs.LG·

AirCast-SR: A Foundation Model for Kilometer-Scale Atmospheric Super-Resolution via Latent Consistency Diffusion

AirCast-SR is an atmospheric super-resolution foundation model that downscales global AI weather forecasts from 28 km to 1 km horizontal resolution. Built on a 3D U-Net conditioned within a Latent Consistency Model diffusion framework, trained on GraphCast forecasts and NOAA data, it produces 67-hour forecasts with near-zero bias and demonstrates zero-shot global transferability to India and Germany.

PapersBenchmarksOpen source
SIG
78
HYP
25
arXiv cs.LG·

The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models

Study on the cost of structured output constraints for small language models (< 3B). Tests on Qwen2.5-0.5B/1.5B and SmolLM2-1.7B show that enforcing JSON schema validity (61.5% → 100%) reduces answer accuracy (19.7% → 11.0%) and increases semantically invalid outputs (49.5% → 88.9%). Recommendation: report schema validity, answer accuracy, and semantic error rates separately.

QwenCode generationEvals
SIG
78
HYP
15
arXiv cs.AI·

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

arXiv paper reveals that models with statistically indistinguishable atomic knowledge fail systematically to chain them in multi-hop reasoning (>40 percentage point gap). Aggregate metrics mask this 'composition collapse'. Authors introduce a double-gate protocol decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth.

ReasoningBenchmarksEvals
SIG
78
HYP
15
arXiv cs.AI·

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX transforms clinical practice guideline (CPG) recommendations into executable decision logic to generate question-answering training data. Post-training a medical LLM on this data improves accuracy by 10.28% across four clinical reasoning benchmarks and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity.

Fine-tuningReasoningEvals
SIG
78
HYP
22
arXiv cs.CL·

FAB-Bench: A Framework for Adaptive RAG Benchmarking in Semiconductor Manufacturing

FAB-Bench is an adaptive benchmarking framework for evaluating RAG systems in semiconductor manufacturing. It defines 6 diagnostic metrics (factual accuracy, contextual utilization, completeness, retrieval relevance, technical depth, reasoning consistency) across context windows of 4K-32K tokens. Benchmark of 200 query-answer pairs tested on 4 LLMs and 4 RAG frameworks.

RAGBenchmarksEvals
SIG
78
HYP
15
arXiv cs.CL·

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

Mechanistic analysis of LLM hallucinations on linearized structured knowledge (graphs, tables). Hallucinations stem from systematic internal dynamics: attention disproportionately concentrates on shortcut structural cues, feed-forward representations fail to ground provided knowledge, model reverts to parametric memory. Patterns generalize to multi-hop graphs and tabular data.

ReasoningPapersAI safety
SIG
78
HYP
15
arXiv cs.LG·

Provably Communication-Efficient and Privacy-Preserving Federated Graph Neural Networks

CE-FedGNN is a federated framework for graph neural networks that reduces communication by infrequently exchanging aggregated node representations instead of per-round embeddings. A moving-average estimator handles cross-client dependencies and staleness. The framework provides privacy guarantees via metric-DP and achieves O(1/√T) convergence with O(T^3/4) communication complexity.

SIG
78
HYP
15
arXiv cs.AI·

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench is a dynamic multi-disciplinary benchmark evaluating reasoning capabilities of multimodal models on 2K+ real exam questions (Math, Physics, Chemistry, Biology). Tests reveal major performance degradation: GPT-5 drops from 79 to 53/100 under realistic exam constraints. Framework includes automated anti-contamination pipeline and end-to-end 'Mock Exam' evaluation scheme.

BenchmarksVisionReasoning
SIG
78
HYP
25