Topic

#AI safety

AI safety covers the practices aimed at making AI systems reliable, aligned with human intentions, and free from harmful behaviors. Anthropic, for instance, builds Claude around explicit safety principles and alignment research.

40Articles
12Sources
64Avg. signal
GitHub Trending·

<svg aria-hidden="true" data-component="Octicon" height="16" viewBox="0 0 16 16" version="1.1" width="16" data-view-component="true" class="octicon octicon-repo mr-1 tmp-mr-1 color-fg-muted"> <path d="M2 2.5A2.5 2.5 0 0 1 4.5 0h8.75a.75.75 0 0 1 .75.75v12.5a.75.75 0 0 1-.75.75h-2.5a.75.75 0 0 1 0-1.5h1.75v-2h-8a1 1 0 0 0-.714 1.7.75.75 0 1 1-1.072 1.05A2.495 2.495 0 0 1 2 11.5Zm10.5-1h-8a1 1 0 0 0-1 1v6.708A2.486 2.486 0 0 1 4.5 9h8ZM5 12.25a.25.25 0 0 1 .25-.25h3.5a.25.25 0 0 1 .25.25v3.25a.25.25 0 0 1-.4.2l-1.45-1.087a.249.249 0 0 0-.3 0L5.4 15.7a.25.25 0 0 1-.4-.2Z"></path> </svg> <span data-view-component="true" class="text-normal"> NVIDIA /</span> OpenShell

OpenShell is a secure, private runtime for autonomous AI agents developed by NVIDIA. The project is available on GitHub and aims to provide controlled execution infrastructure for multi-agent systems.

AI AgentsMulti-agentInfrastructure
SIG
45
HYP
00
Reddit r/MachineLearning·

LLM agents patch security bugs, pass all tests, but still leave the vulnerability open [R]

CVE-Bench evaluates 5 frontier models on 20 real-world CVEs (Pillow, GitPython, urllib3, etc.) across 300 runs. Max solve rate 50% (60% under advisory). Agents patch syntactically but leave vulnerabilities open. Significant cross-family gaps (OpenAI vs Laguna, p<0.05), within-family noise. Failure modes: wrong-search drift, hallucinations, context loss.

AI AgentsBenchmarksAI safety
SIG
78
HYP
00
arXiv cs.AI·

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

Academic paper proposing product-aware autoencoders for anomaly detection in multi-product cyber-physical systems. Traditional global models create blind spots where attacks can evade detection. Tests on Tennessee Eastman Process benchmark: product-aware model achieves 100% detection accuracy versus 22.2% for global baseline in attack scenarios.

BenchmarksAI safetyEvals
SIG
72
HYP
00
arXiv cs.LG·

Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference

Online, distribution-free framework for controlling Conditional Value-at-Risk (CVaR) in non-stationary and adversarial environments. Combines conformal tail risk control, online learning, and Rockafellar-Uryasev variational representation. Provable safety guarantees for nonlinear tail risk under arbitrary data-generating processes. Applications: portfolio risk management and LLM toxicity mitigation.

PapersAI safetyReasoning
SIG
72
HYP
00
arXiv cs.CL·

AEyeDE: An Attention-Based Attribution Framework for AI-Generated Text Detection

AEyeDE introduces an attention-based attribution framework for detecting AI-generated text using attention matrices from a proxy Transformer model. A lightweight CNN learns discriminative representations from these attribution maps. The method outperforms text-only baselines, shows strong generator-specific detection, and demonstrates robustness under cross-dataset transfer and spelling perturbations.

PapersAI safetyEvals
SIG
72
HYP
00
arXiv cs.CL·

BOUTEF: A Multilingual Corpus for FakeNews in North Africa -- Language as a Weapon

BOUTEF is a multilingual corpus from 2 countries (Algeria, Tunisia) covering fake news, authentic narratives, comments, and debunking. Includes MSA, Algerian/Tunisian dialects, Arabizi, French, English, and code-switching. Analysis shows fake news relies on emotionally charged narratives and sensational framing, while debunking adopts a factual, verification-oriented style.

PapersBenchmarksAI safety
SIG
72
HYP
00
arXiv cs.CL·

Which Institutional Frameworks Do Chatbots Assume? Auditing Jurisdictional Defaults in Multilingual LLMs

Audit of 7 LLMs (US/China) on 2,520 responses to 60 legal-administrative prompts in English and Mandarin. Models default to the institutional framework of input language: 74.5% of English responses adopt US framework, 53.3% of Chinese responses adopt China framework. Risk of jurisdictional misselection when preferred language differs from applicable jurisdiction.

BenchmarksAI safetyRegulation
SIG
78
HYP
00
arXiv cs.AI·

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Modern LLMs systematically overestimate their competence and attempt unsolvable queries. Researchers propose Capability Self-Assessment (CSA), formulated as a policy-learning problem using reinforcement learning, to teach models to recognize their limits. RL significantly outperforms supervised fine-tuning, preserves original capabilities, and generalizes out-of-distribution.

Reinforcement learningAlignmentEvals
SIG
78
HYP
00