Archives

May 2026

3148 articles

arXiv cs.CL·

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

Implementation study of Google Notebook LM generating videos, podcasts, and infographics in an English for Academic Purposes course (106 students, Hong Kong). Students rated high perceived usefulness and ease of use; preference for visual/multimodal content. Positive correlation between video preference and academic performance, but higher cognitive load negatively associated with grades.

RAGToolsEvals
SIG
72
HYP
25
arXiv cs.AI·

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule is a multimodal, multilingual benchmark for moderating pluralistic communities on social media. It covers 13,371 rule violations across 1,989 Reddit communities and 2,885 rules in 9 languages. State-of-the-art vision-language models, including GPT-4.5 with advanced reasoning, only marginally outperform a trivial baseline, revealing that pluralistic moderation remains a fundamental challenge.

BenchmarksVisionAI safety
SIG
72
HYP
25
arXiv cs.AI·

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

Approach to process body-worn camera (BWC) video into 10-second windows labeled by operational context and motion intensity. Models trained with CLIP and optical flow: 78.75% accuracy for context, 88.33% for activity. Privacy-conscious protocol to speed up incident review and officer training workflows.

VisionBenchmarksAI safety
SIG
72
HYP
15
arXiv cs.AI·

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

Behavior Foundation Models (BFMs) enable scalable imitation learning but fail under dynamics shifts (friction, actuation, noise). This paper formulates BFM task-inference as robust minimax optimization, enabling adaptation to worst-case dynamics perturbations without retraining. The framework outperforms standard BFM and robust offline IL baselines under dynamics shifts.

Reinforcement learningPapersEvals
SIG
72
HYP
18
arXiv cs.AI·

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

Systematic audit of two critical vulnerabilities in clinical AI: adversarial fragility and cross-lingual drift. On CheXNet (DenseNet121), accuracy collapses from 89.3% to 62.0% under imperceptible FGM perturbation (epsilon=0.021). Llama3.1:8b and NatLAS show major degradation on Nigerian Pidgin and Yoruba (80%→65%, 85%→55%). Standard defenses fail.

AI safetyAlignmentEvals
SIG
78
HYP
25
arXiv cs.AI·

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

VLMs struggle with planning from complex visual inputs. This paper proposes Pattern Induction, an online inductive learning strategy that discovers and optimizes reusable visual patterns as composite experts. Pattern Inference enables VLMs to recognize these patterns and directly infer world model structures. Evaluated on FrozenLake, Crafter, and CubeBench.

VisionReasoningPapers
SIG
65
HYP
25
arXiv cs.CL·

Ancient Greek to Modern Greek Machine Translation: A Novel Benchmark and Fine-Tuning Experiments on LLMs and NMT Models

New AG-MG parallel corpus with 132,481 sentence pairs for Ancient-to-Modern Greek translation. Creation pipeline combines web-scraping, VecAlign alignment with fine-tuned LaBSE embeddings, and Gemini 2.5 Flash LLM-based correction. Benchmark of NMT models (NLLB, M2M100) and Greek LLM (Llama-Krikri-8B): full fine-tuning achieves 13.16 BLEU, gains up to +10.3 points.

BenchmarksFine-tuningEmbeddings
SIG
78
HYP
15
arXiv cs.AI·

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena is an open-source benchmark for evaluating AI coding agents on GPU kernel optimization. It contains 196 tasks (HIP-to-HIP, Triton-to-Triton, PyTorch-to-HIP) and tests generalization to unseen configurations. Cursor Agent, Claude Code, and Codex Agent achieve speedups up to 6.89x, but PyTorch-to-HIP optimizations show correctness drops on unseen configurations.

AI AgentsCode generationBenchmarks
SIG
78
HYP
15
arXiv cs.AI·

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

AAMLA, a multimodal learning analytics framework, predicts student collaboration satisfaction in game-based educational environments. The CAMA module aligns modalities (gaze, action units, pose) via affinity matrices and contrastive learning, adaptively suppressing uninformative modalities. Tests on 50 middle school students in EcoJourneys show improvement over unimodal baselines.

VisionMulti-agentEvals
SIG
62
HYP
18
arXiv cs.AI·

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely is a causal intelligence layer for SRE workflows that structures environment topology and causal dependencies. Benchmark across 4 agent configurations (Claude Code, OpenAI Codex, HolmesGPT): with Causely, mean time-to-diagnosis reduced 63%, token consumption -60%, tool calls -78%, API cost per run -57%, root-cause accuracy 75%→100%.

AI AgentsBenchmarksClaude Code
SIG
78
HYP
25
arXiv cs.AI·

GraViti: Graph-Level Variational Autoencoders with Relaxed Permutation Invariance

GraViti is a transformer-based variational autoencoder for entire graphs, producing a true graph-level latent space. On molecular benchmarks, the model learns to decode valid samples respecting chemical constraints. The work shows that enforcing permutation invariance can be detrimental for consistent reconstruction when a reliable canonical node ordering exists.

PapersBenchmarksCode generation
SIG
72
HYP
18
arXiv cs.AI·

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

MATE is a memory architecture for solving Contextual Markov Decision Processes (CMDPs). It replaces the intractable posterior belief with sum-aggregated memory, avoiding growing computational costs of Transformers and gradient issues of RNNs. Evaluations demonstrate computational advantages while achieving performance comparable to standard sequence-model baselines.

ReasoningReinforcement learningPapers
SIG
72
HYP
15
arXiv cs.AI·

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Study on controlled removal of safety alignment in language models to evaluate cybersecurity capabilities. Compares authorized-context prompting, refusal-direction projection, and LoRA-based de-alignment. On 60 tasks (Security-AR), task-only LoRA reaches 0.87 security score with 0.83 general capability, but increases out-of-scope unsafe compliance.

AI safetyAlignmentFine-tuning
SIG
72
HYP
15
Reddit r/MachineLearning·

We built a tool that installs frameworks like ComfyUI, Ollama, OpenWebUI etc on any cloud GPU in one command and saves your whole setup between sessions [R]

swm is an open-source tool automating framework installation (ComfyUI, Ollama, OpenWebUI, vLLM) on cloud GPUs in one command. It aggregates pricing across 10+ providers (RunPod, Vast.ai, Lambda), syncs workspaces via S3, and auto-terminates idle instances after 30 min to cut costs.

ToolsOpen sourceInfrastructure
SIG
72
HYP
35