Page 63 of 147

AllHigh signalRecent

5869 articles

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

Study proposing a parallel chunk-level processing framework for analyzing long documents with LLMs. Text is divided into semantically coherent segments processed independently, then consolidated with explicit evidence anchoring. Results: 84% reduction in omission error, 130% increase in evidence traceability, 91% reduction in unsupported claims.

Reasoning Evals Prompt engineering

SIG

HYP

arXiv cs.AI·May 22

Governance by Construction for Generalist Agents

CUGA introduces a modular governance system for generalist LLM agents in enterprise settings. Through five enforcement checkpoints (Intent Guard, Playbook, Tool Guide, Tool Approvals, Output Formatter), the platform enforces policies without model fine-tuning, ensuring compliance and auditability across compound workflows.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.AI·May 22

For How Long Should We Be Punching? Learning Action Duration in Fighting Games

RL study on fighting games (Street Fighter II). Agents learn to predict both action and execution duration instead of deciding at fixed intervals. FightLadder experiments: learned timing matches fixed frame skip performance but encourages repeatable exploitable action patterns.

Reinforcement learning AI Agents Papers

SIG

HYP

arXiv cs.AI·May 22

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

DIP (Diverge-to-Induce Prompting) generates multiple diverse rationales per question, elaborates them into detailed step-by-step plans, then induces them into a final plan. Improves zero-shot reasoning accuracy without resource-intensive sampling vs single-strategy prompting.

Prompt engineering Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Students construct QuestBench, a 256-question benchmark across humanities and social sciences, to evaluate deep research systems. Testing reveals GPT-4.5 reaches 57.58% pass rate while mean performance is 16.85% across 13 systems, exposing hidden failures. This classroom practice teaches students to judge AI output quality and remain responsible knowledge actors.

Benchmarks Evals GPT

SIG

HYP

arXiv cs.AI·May 22

Mind the Sim-to-Real Gap & Think Like a Scientist

Theoretical work on balancing pre-trained simulators with real experiments in sequential decision-making. Decomposes simulator error into calibration-deployment shift and parametric residual. Proposes Fisher-SEP, an experimental policy minimizing posterior predictive variance. Case studies: vending-machine supply chain and HIV mobile testing.

Reinforcement learning Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

GraphDiffMed: Knowledge-Constrained Differential Attention with Pharmacological Graph Priors for Medication Recommendation

GraphDiffMed presents a medication recommendation framework using dual-scale Differential Attention v2 with pharmacological constraints. Tested on MIMIC-III, the model filters noise at intra-visit and inter-visit levels while integrating drug-drug interactions, outperforming baselines on quality and safety metrics.

Benchmarks Papers AI safety

SIG

HYP

arXiv cs.CL·May 22

GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

GHI is a Graphormer-based framework for aspect-based sentiment analysis (ABSA). It uses a bipartite hypergraph structure to represent token-hyperedge incidence relations, integrating linguistic and semantic signals. With 247M parameters, GHI outperforms DeBERTa on six SemEval benchmarks and approaches Flan-T5 11B performance on ISE.

Papers Benchmarks Reasoning

SIG

HYP

arXiv cs.CL·May 22

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

IdioLink is a retrieval benchmark with 10,700 documents and 2,140 queries across 107 idioms. It tests whether models can link idiomatic expressions to their literal equivalents. Current embeddings (BGE, E5, Contriever, Qwen) fail, relying on shallow topical cues instead of semantic abstraction.

Benchmarks Embeddings RAG

SIG

HYP

arXiv cs.LG·May 22

I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models

I-SAFE is a post-hoc auditing framework for scientific AI models based on the Wasserstein Coherence Metric (WCM). It evaluates whether model predictions reflect domain structure or exploit statistical shortcuts. Tested on drug-target interaction prediction (DeepConvDTI, DeepDTA, TAPB), it reveals distinct distributional response profiles invisible to accuracy metrics.

Evals AI safety Alignment

SIG

HYP

arXiv cs.LG·May 22

Hierarchical Variational Policies for Reward-Guided Diffusion

Hierarchical variational framework for adapting pretrained diffusion models to reward-aligned objectives. Formulates test-time adaptation as a lightweight stochastic policy that amortizes per-step control. On 4x super-resolution: better perceptual quality with 5x faster inference than best baseline.

Reinforcement learning Image generation

SIG

HYP

arXiv cs.CL·May 22

Pattern-and-root inflectional morphology: the Arabic broken plural

Computational model of Arabic inflectional morphology focused on broken plurals. Reverses traditional root-and-pattern paradigm into pattern-and-root. Applied to 3,200 noun entries with 160 inflectional classes (22 triliteral patterns, 3 quadriliteral). Formal separation of inflection, derivation, and semantics.

SIG

HYP

arXiv cs.LG·May 22

Alike Parts: A Feature-Informed Approach to Local and Global Prototype Explanations

New explainable prototype framework integrating feature importance at two levels: "alike parts" for local explanations (highlights shared feature subsets between instance and prototype) and augmented global selection to promote diversity in prototype feature attributions. Experiments on 6 benchmarks show maintained or improved surrogate model fidelity.

Evals Papers

SIG

HYP

arXiv cs.LG·May 22

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Self-Paced Curriculum Learning (SPCL) framework for multimodal emotion recognition in conversations. Dual-level Difficulty Measurer (utterance and conversation level) guides training from easier to harder instances. IEMOCAP tests show +1.2% to +6.6% F1 improvement, MELD reaches +10.4%, addressing modality imbalance.

Reasoning Benchmarks

SIG

HYP

arXiv cs.LG·May 22

Beyond Single Slot: Joint Optimization for Multi-Slot Guaranteed Display Advertising

Joint optimization framework for multi-slot guaranteed display advertising allocation. Formulates problem as offline bipartite matching with contract roulette mechanism and Page View constraints. Online A/B tests on Meituan platform: +28.99% ARPU at 70% traffic, improved contract stability.

Business Benchmarks

SIG

HYP

arXiv cs.LG·May 22

Tabular foundation models for robust calibration of near-infrared chemical sensing data

Benchmark of TabPFN (tabular foundation model) on 66 NIR datasets (54 regression, 12 classification tasks). Optimized TabPFN outperforms PLS, CatBoost, and CNN-1D in regression; matches Ridge in classification. Advantage diminishes on spectral outliers and extrapolation.

Benchmarks Papers Tools

SIG

HYP

arXiv cs.LG·May 22

Discovering Entity-Conditioned Lag Heterogeneity: A Lag-Gated Neural Audit Framework for Panel Time Series

AC-GATE, a neural model with adaptive gating, discovers how different entities (countries) respond to historical signals across varying time horizons in panel time series. The framework separates predictive calibration from lag discovery, validated on synthetic data with known ground-truth lags and two real country-level panels.

Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 22

DualOptim+: Bridging Shared and Decoupled Optimizer States for Better Machine Unlearning in Large Language Models

DualOptim+ is an optimization framework for machine unlearning in LLMs. It uses shared base states and decoupled delta states to balance forgetting and retention objectives. An 8bit variant reduces memory overhead. Tested on fictitious/real unlearning, safety alignment, and multi-task learning.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.CL·May 22

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

Study of 25 embedding models on 5 MTEB tasks showing that nearest-neighbor overlap and magnitude differences in ICA strongly correlate (up to 0.97) with performance. Embedding tasks display varying degrees of linearity and reliance on local information retention.

Embeddings Benchmarks Evals

SIG

HYP

arXiv cs.LG·May 22

A Reproducible Log-Driven AutoML Framework for Interpretable Pipeline Optimization in Healthcare Risk Prediction

yvsoucom-iterkit, a deterministic log-driven AutoML framework, optimizes medical risk prediction pipelines across 18,000+ configurations. On Pima and Stroke datasets, augmentation (0.454), model choice (0.198), and imbalance handling (0.101–0.406) are key drivers. Ensembles achieve F1 0.89–0.94 with cross-seed robustness (variability 0.023–0.026).

Benchmarks Evals Fine-tuning

SIG

HYP

arXiv cs.AI·May 22

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

SOLAR is an autonomous agent using parameter-level meta-learning to continuously adapt to non-stationary data streams. It combines multi-level reinforcement learning and episodic memory to balance plasticity and stability, outperforming baselines on common-sense, mathematical, medical, coding, social and logical reasoning tasks.

AI Agents Reinforcement learning Reasoning

SIG

HYP

arXiv cs.CL·May 22

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Study of factual recall mechanisms in multimodal language models (text + speech). Using causal mediation analysis on SpiritLM, researchers show that knowledge storage and retrieval mechanisms only partially transfer from text to speech modality.

Papers Reasoning Voice

SIG

HYP

arXiv cs.CL·May 22

Cross-Lingual Consensus: Aligning Multilingual Cultural Knowledge via Multilingual Self-Consistency

Self-supervised framework to align multilingual cultural knowledge in LLMs. Uses multilingual self-consistency to identify reliable cultural responses and transfer them to weaker languages. Improves BLEnD benchmark performance by average 5.03% using only self-generated data.

Prompt engineering Reasoning Benchmarks

SIG

HYP

arXiv cs.CL·May 22

Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild

Hy-MT2 is a family of multilingual translation models (1.8B, 7B, 30B-MoE) supporting 33 languages. The 1.8B model quantized at 1.25-bit weighs 440 MB and improves inference speed by 1.5x. The 7B and 30B models outperform DeepSeek-V4-Pro and Kimi K2.6 in fast-thinking mode; the 1.8B surpasses commercial APIs from Microsoft and Doubao.

Benchmarks Code generation DeepSeek

SIG

HYP

arXiv cs.AI·May 22

Interaction Locality in Hierarchical Recursive Reasoning

Framework to measure whether information flow stays localized or crosses semantic boundaries in spatial reasoning. Applied to HRM and TRM (hierarchical recursive models) on Maze-Hard, Sudoku Extreme, and ARC-AGI. Activation patching reveals high-level recurrent states write locally, progressively accumulating global structure.

Reasoning Evals Papers

SIG

HYP

arXiv cs.AI·May 22

High Quality Embeddings for Horn Logic Reasoning

Method for creating high-quality embeddings for Horn logic reasoning. Authors use triplet loss with three innovations: anchors with repeated terms, balanced easy/medium/hard examples, and periodic emphasis of hardest cases. Evaluation across multiple knowledge bases.

Embeddings Reasoning Papers

SIG

HYP

arXiv cs.CL·May 22

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni proposes an audio-visual reasoning framework using unified latent space instead of explicit text chain-of-thought. The model interleaves textual reasoning with audio-visual latent states, introduces Omni-Sync Position Embedding (OSPE) for temporal consistency, and leverages LatentOmni-Instruct-35K (35K annotated trajectories). Outperforms text-based baselines on audio-visual benchmarks.

Reasoning Papers

SIG

HYP

arXiv cs.AI·May 22

Personality Engineering with AI Agents: A New Methodology for Negotiation Research

Researchers introduce "personality engineering," a methodology using AI agents to rigorously test negotiation theories. AI agents precisely parameterize negotiator personalities along two dimensions (warmth and dominance) from the interpersonal circumplex, enabling controlled experiments impossible with humans.

AI Agents Papers Reasoning

SIG

HYP

arXiv cs.CL·May 22

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

Claim-selective certification system for high-risk medical RAG. Each response decomposed into verifiable claims, scored against retrieved evidence, mapped to {full, partial, conflict, abstain}. On weak-label certificate protocol, UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, action accuracy=0.9204 (dev, n=314) and 0.8997 (test, n=319).

RAG Evals AI safety

SIG

HYP

arXiv cs.LG·May 22

Correcting Class Imbalance in Prior-Data Fitted Networks for Tabular Classification

Prior-data fitted networks (PFNs) excel at tabular classification but suffer from class imbalance affecting rare classes. This study adapts classical mitigation techniques (thresholding, downsampling) to PFNs, finding thresholding outperforms due to PFN calibration properties, while downsampling provides comparable results with reduced inference cost.

Benchmarks Evals

SIG

HYP

OpenAI Blog·May 22

How Virgin Atlantic ships faster with Codex

Virgin Atlantic used Codex to ship its revamped mobile app on a fixed holiday travel deadline, achieving near-total unit test coverage and zero P1 defects.

Code generation OpenAI

SIG

HYP

Reddit r/LocalLLaMA·May 21

Latest b9274 Addresses MTP VRAM leak

Commit b9274 fixes a VRAM leak in MTP (Multi-Token Prediction) models. The destroy() function failed to free speculative decoder, draft context, and draft model resources, causing memory accumulation on each sleep/resume cycle. Fix explicitly resets these components before llama_init.

Llama Code generation Infrastructure

SIG

HYP

Latent Space·May 21

Giving Agents Computers — Ivan Burazin, Daytona

Daytona, an agent execution platform, reports 74% MoM growth and 850K daily runs. The startup offers bare metal sandboxes and reinforcement learning evals for autonomous agents.

AI Agents Reinforcement learning Evals

SIG

HYP

Reddit r/LocalLLaMA·May 21

Interesting paper advocates for quantized prefilling and precise decoding

Paper proposes Mix-Quant: use W4A4 quantization for prefilling (theoretical 4x speedup) but keep full precision for decoding. Prefilling tolerates quantization errors since they don't accumulate, unlike autoregressive decoding where each token affects subsequent generation.

Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·May 21

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings.

A paper published on arXiv shows honesty in small open-source models drops from 35% to 0% by changing prompt tone. When asked to solve mathematically impossible coding problems, models admit impossibility 33% of the time in neutral language but 0% under pressure. Internal analysis reveals each tone leaves a distinct signature in the network's deepest layers.

Papers Alignment AI safety

SIG

HYP

Reddit r/LocalLLaMA·May 21

LlamaStation v0.9 — llama.cpp GUI for Windows with multi-backend support, TurboQuant, MTP and more

LlamaStation v0.9 is a Windows GUI for llama.cpp with multi-backend support (TurboQuant, MTP, AtomicChat, BeeLlama). Runs llama-server directly without intermediate layer, provides full parameter control, real-time VRAM metering, per-model profiles, offline voice mode (XTTS v2 + faster-whisper), headless mode, and auto-updates.

Llama Tools Open source

SIG

HYP

Reddit r/LocalLLaMA·May 21

110 tok/s with 12GB VRAM on Qwen3.6 35B A3B and ik_llama.cpp

ik_llama.cpp outperforms llama.cpp on RTX 4070 Super 12GB: 110 tok/s average vs 90.6 tok/s with Qwen3.6-35B-A3B-IQ4_XS. Better CPU offloading optimization and speculative decoding (MTP) after llama.cpp performance regression post-merge.

Qwen Open source Infrastructure

SIG

HYP

arXiv cs.LG·May 21

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

Residual Paving is a routed residual editing method for frozen transformers that decouples route selectivity (whether to intervene) from residual-edit capacity (what edit to apply). On Gemma-3-4B-IT, it reduces edit refusal from 88.6% to 4.0% while preserving 95.5% benign behavior and 87.3% harmful refusals.

AI safety Alignment Fine-tuning

SIG

HYP

arXiv cs.CL·May 21

Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction

Framework for processing long documents via parallel chunking and evidence-anchored consolidation. Reduces omission error by 84%, increases evidence traceability by 130%, decreases unsupported claims by 91%. Smaller models benefit most.

Reasoning Benchmarks Papers

SIG

HYP

arXiv cs.LG·May 21

Less Data, Faster Training: repeating smaller datasets speeds up learning via sampling biases

Repeating a smaller dataset during training accelerates learning compared to using a larger dataset, via sampling biases that enable favorable layer-wise growth. Effect observed across algorithmic tasks, architectures and optimizers. Authors provide theoretical analysis and empirical interventions.

Papers Reasoning Reinforcement learning

SIG

HYP