Page 60 of 147

AllHigh signalRecent
5857 articles
Reddit r/LocalLLaMA·

I built a computer use sandbox framework for codex on headless linux. GPU passthrough, computer use, and sudo access for codex all work. It's the perfect dev sandbox to allow full auto work while minimizing the "rm -rf /" risk

Developer builds sandbox framework for AI agents on headless Linux with GPU passthrough, sudo access, and host OS isolation. VM-based architecture enables autonomous web browsing, Docker execution, and parallel sessions. Code released on GitHub.

AI AgentsCode generationInfrastructure
SIG
72
HYP
28
arXiv cs.CL·

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

Empirical study on curriculum effects for RL memory agents in multi-session dialogue with external memory banks. Three training conditions tested (LoCoMo only, LoCoMo + LongMemEval, LongMemEval only) show curriculum composition shapes specialized skills rather than uniform performance scaling. Mixed curriculum achieves strongest overall F1.

Reinforcement learningAI AgentsReasoning
SIG
72
HYP
15
arXiv cs.CL·

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

Unified framework for cost-performance optimization in LLM context management. Jointly evaluates task performance, token cost, and preprocessing reuse on 5,000 HotpotQA instances. Reduces effective token usage by 25% at comparable performance (F1≈0.78) and achieves 50% lower token cost with memory compression versus full-context prompting.

RAGBenchmarksInfrastructure
SIG
72
HYP
18
arXiv cs.AI·

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

VDSS, a multi-agent system for ventilator decision support, coordinates modular components through structured interfaces and contextual bandit preference learning from clinician feedback. Structured rejection triggers targeted replanning. Retrospective ICU validation shows higher recommendation acceptability and fewer interaction cycles.

Multi-agentReinforcement learningAI Agents
SIG
72
HYP
18
arXiv cs.CL·

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

SCID-anchored benchmark of 555 semi-structured interviews evaluates 5 LLMs (GPT-4.1 Mini, GPT-5 Mini) on psychiatric screening (anxiety, depression, PTSD). Accuracy 0.49–0.86, MCC 0.16–0.38. False negatives reveal models downweight symptoms when functioning is preserved or social support present, requiring clinical validation before deployment.

BenchmarksGPTAI safety
SIG
72
HYP
25
arXiv cs.LG·

Open Multimodal Datasets and Open-Source Software for Data-Driven Modeling of Multiphase Transport and Thermal Systems

NED3 Laboratory releases an open-source ecosystem of multimodal datasets and software for data-driven modeling of multiphase transport and thermal systems. S+TD framework classifies datasets from 0+0D to 3+0D dimensionality; 7 software packages (BubbleID, SeqReg, CFDTwin, IRISApp, decode-wfs, AELab, FlowLab) cover computer vision, sequence regression, and multimodal diagnostics.

Open sourceBenchmarksTools
SIG
72
HYP
18
arXiv cs.LG·

FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

FederatedRSF is a Python package implementing federated random survival forests for multi-center survival prediction without sharing raw patient data. The system handles feature-space heterogeneity (different covariates across sites) by redistributing only compatible trees. Evaluation on GBSG2 breast cancer cohort shows performance comparable to centralized training.

PapersOpen sourceAI safety
SIG
72
HYP
15
arXiv cs.LG·

WeCon: An Efficient Weight-Conditioned Neural Solver for Multi-Objective Combinatorial Optimization Problems

WeCon is a neural solver for Multi-Objective Combinatorial Optimization Problems (MOCOPs). It introduces Gated Residual Fusion blocks to better integrate weights and features, a Residual Fusion block in the decoder, and an Efficient Preference Optimization method. On 4 MOCOP variants, WeCon matches POCCO-W's HyperVolume while reducing inference time by 40%.

BenchmarksReasoning
SIG
72
HYP
18
arXiv cs.LG·

Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity

ManiF-SMC proposes machine unlearning via manifold representation forgetting with adaptive margin-based triplet loss guided by self-mode connectivity. The method pushes erased samples away from original learned manifold centroids toward retained data neighbors, operating purely in representation space. Experiments on 4 datasets match state-of-the-art approximate unlearning effectiveness.

PapersAI safetyAlignment
SIG
72
HYP
15
arXiv cs.CL·

ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

ClimateChat-300K: dataset of 299,329 public Facebook posts on climate change (May 2020–May 2024), collected via CrowdTangle. 41 metadata features, 26,000+ global pages. Topic modeling and sentiment analysis identify 10 themes across 5 domains; emotionally charged and visually rich content drives highest engagement. Open resource for studying polarization and misinformation.

BenchmarksPapersOpen source
SIG
72
HYP
25
arXiv cs.CL·

DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

DFKI-MLT applies activation steering to multilingual LLMs to improve cultural awareness in SemEval-2026 Task 7. The method adds language-specific steering vectors to the residual stream without parameter updates. Result: 86.96% accuracy on MCQ track (7th/17), but modest and heterogeneous improvements varying by language-region pair and layer selection.

Prompt engineeringReasoningFine-tuning
SIG
72
HYP
18
arXiv cs.LG·

A mathematical theory of balancing relational generalization and memorization

Theoretical study on balancing relational generalization and memorization in learning systems. Authors introduce transitive inference with exceptions task and analytically characterize kernel ridge regression models across representations. Validation on pretrained language models shows successful generalization depends on representational geometry, with systematic errors predicted by theory.

PapersReasoningEvals
SIG
72
HYP
15
arXiv cs.AI·

KPI2KVI: A Multi Agent Workflow for Calculating Key Value Indicators from Service Descriptions

KPI2KVI transforms natural language service descriptions into Key Value Indicator (KVI) estimates using a deterministic multi-agent LLM workflow. The system elicits missing context, extracts relevant KVI categories, generates service-specific KPIs, collects values through interactive dialogue with intelligent estimation, and computes interval-valued KVIs with traceable explanations.

AI AgentsMulti-agentPrompt engineering
SIG
72
HYP
25