Topic

#AI safety

AI safety covers the practices aimed at making AI systems reliable, aligned with human intentions, and free from harmful behaviors. Anthropic, for instance, builds Claude around explicit safety principles and alignment research.

40Articles

5Sources

70Avg. signal

arXiv cs.CL·Jun 18

Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text

Benchmark of 1,200 clinical documents with 9,184 uncertainty annotations across five levels. LLMs poorly preserve uncertainty expressions (less than 50% of cases) and struggle with nuanced distinctions between adjacent levels. Reveals a failure mode missed by standard metrics.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Redact or Keep? A Fully Local AI Cascade for Educational Dialogue De-Identification

Local de-identification framework for educational dialogues. Two-stage cascade: union proposer (lightweight encoders + deterministic rules) generates PII candidates, then binary Redact/Keep reviewer uses dialogue context and speaker role. Achieves 0.958 macro F1 on math tutoring transcripts, outperforms commercial API (0.706) and local LLM baseline (0.767), runs on single laptop.

RAG AI safety Papers

SIG

HYP

arXiv cs.CL·Jun 18

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Study of collateral damage in LLM machine unlearning. Authors show damage propagates beyond the forget set following semantic distance gradients, and propose PreUnlearn, a pre-unlearning prediction method to audit risks before execution.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

Output Vector Editing for Memorization Mitigation in Large Language Models

Memorization suppression method in LLMs via output vector editing of MLP neurons. Tested on 4 models (360M-7B parameters), achieves 87.9% suppression on OLMo-7B with 6831 memorized sequences. Complementary approach to existing neuron ablation methods.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

RedactionBench

RedactionBench is a manually annotated benchmark of 200 documents across 11 domains for evaluating PII redaction in context. Introduced with R-Score, a character-level metric, it shows 35 models (NER, SLM, frontier models) fail on contextual redactions: human consensus 89.4% for mandatory redactions, 47.7% for contextual ones.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 18

Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports

Study on evaluating AI-generated radiology reports. Researchers show existing LLMs over-penalize harmless rephrasings while detecting clinical errors. They train lightweight metrics on Qwen3-8B and MedGemma-4B outperforming 32B medical models, with dataset and metric release planned.

Benchmarks Evals Papers

SIG

HYP

arXiv cs.LG·Jun 18

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL optimizes consistency between language models' self-explanations and behavior via reinforcement learning. On probabilistic reasoning tasks, the method improves R² correlation from 0.24 to 0.64. In constitutional AI, it increases refusal prediction from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5%.

Reinforcement learning Alignment AI safety

SIG

HYP

arXiv cs.AI·Jun 18

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench is a safety evaluation benchmark for LLMs in AI4Science workflows. It covers 7 disciplines, 31 sub-disciplines, and 10 risk dimensions. The authors evaluate mainstream and science-oriented LLMs to diagnose safety gaps across risk categories.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.AI·Jun 18

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Safety Reflection Pretraining inserts short safety reflections into pretraining corpora to establish self-monitoring directly in language modeling. On 1.7B models pretrained on FineWeb-Edu, the method improves safety classification accuracy and substantially reduces success rates of inference-stage and finetuning attacks.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 18

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

PhysAssistBench is an interactive medical assistance benchmark with 1,296 physician-validated turns built from real MIMIC-IV cases. It evaluates LLMs' ability to coordinate clinical knowledge, patient communication, and EHR system interaction within single dialogues. Experiments show current models remain unreliable in this setting.

Benchmarks AI Agents Multi-agent

SIG

HYP

arXiv cs.CL·Jun 18

Aligning Implied Statements for Implicit Hate Speech Generalizability with Context-Bounded Semi-hard Negative Mining

ImpSH, a triplet-based framework, improves implicit hate speech detection by aligning posts with implied statements and using context-bounded semi-hard negatives. Evaluated on IHC, SBIC, and DynaHate with BERT and HateBERT, it enhances cross-domain performance and provides more stable representations than standard supervised contrastive approaches.

Benchmarks AI safety Papers

SIG

HYP

arXiv cs.LG·Jun 18

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Sparse Autoencoders (SAEs) decompose activations into interpretable features, but this study shows that clamping a 'harmful' feature does not eliminate the behavior—it can recover via other residual pathways. Even with active intervention, 95.8% behavior recovery is achievable in refusal-steering, exposing a gap between feature-level control and behavioral completeness.

AI safety Alignment Evals

SIG

HYP

arXiv cs.LG·Jun 18

The Illusion of Improvement: Reject Inference Strategies in Credit Scoring

Reject inference methods used in credit scoring to correct survival bias mask a structural failure: accuracy can improve while the ability to correctly reject defaulters collapses. Authors propose a controlled exploration strategy (approving 2-5% of rejected applicants) to diagnose this deterioration without strong statistical assumptions.

Benchmarks AI safety Evals

SIG

HYP

arXiv cs.LG·Jun 18

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

Causal framework for sleep recovery scoring from multimodal polysomnography. Uses DAG learning on two cohorts (MESA n=1540, MrOS n=825) to identify five physiological domains (respiratory burden, hypoxia, fragmentation, architecture, autonomic regulation). Sleep Recovery Score (SRS) achieves 2.5× stronger alignment with perceived recovery than standard AHI.

Papers Reasoning Evals

SIG

HYP

Hacker News (AI)·Jun 17

License Plate Cameras Will Soon Track Phones, Wearables, Infotainment and Pets

License plate cameras will soon track phones, wearables, infotainment systems and pets via Bluetooth and WiFi. Mass surveillance technology in development.

AI safety Regulation

SIG

HYP

Hacker News (AI)·Jun 17

I scored 200 blockchain NPM packages for deprecation and hijack risk

Security audit of 200 blockchain-related NPM packages: assessment of deprecation and hijack risks. Scoring methodology applied to critical dependency ecosystem.

AI safety Open source

SIG

HYP

Hacker News (AI)·Jun 17

The hacker sent by Anthropic to calm the government's nerves about AI safety

Anthropic deploys a security expert to government officials to address AI safety concerns. The move aims to establish direct dialogue between the company and regulators on safety and alignment issues.

Anthropic AI safety Regulation

SIG

HYP

Hacker News (AI)·Jun 17

Only 16 Percent of Americans Think AI Will Have a Positive Impact on Society

Poll: Only 16% of Americans believe AI will have a positive societal impact. Majority expresses concerns about economic and social effects, while experts remain more optimistic.

AI safety Regulation

SIG

HYP

The Decoder·Jun 17

OpenAI researchers want to predict how often AI models will fail before launch

OpenAI researchers propose a method to predict how often a new AI model will make mistakes after release. This approach could fill gaps left by standard safety testing.

OpenAI Evals AI safety

SIG

HYP

Hacker News (AI)·Jun 17

AI demands more engineering discipline. Not less

Article arguing for increased engineering discipline in AI development, against trends minimizing technical standards. Criticizes 'move fast and break things' approach applied to critical systems.

AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 17

LLMs Infer Cultural Context but Fail to Apply It When Responding

LLMs can infer cultural context but fail to apply it in responses. A new CAPRI dataset shows models recognize cultural conventions (measurement units, time interpretation) but don't spontaneously use them unless explicitly instructed. Biases remain aligned with the model's country of origin.

Benchmarks Alignment AI safety

SIG

HYP

arXiv cs.LG·Jun 17

Sum-of-Squares Degree Barriers for the Reweighted-Hinge Method in Robust Halfspace Learning: A Christoffel-Function Characterization

Theoretical paper on Sum-of-Squares degree barriers for robust halfspace learning under malicious noise. The Christoffel function exactly characterizes corruption hidden from bounded-degree certificates. Proves a margin-degree tradeoff and a degree-2t algorithm achieving the frontier η^(1-1/2t).

Papers Reasoning AI safety

SIG

HYP

arXiv cs.LG·Jun 17

Rift: A Conflict Signature for Deception in Language Models

Researchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).

AI safety Alignment Evals

SIG

HYP

arXiv cs.CL·Jun 17

AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

AIPatient Arena evaluates LLMs in multi-turn clinical consultation across 8 competence dimensions using EHR-grounded knowledge graphs. On 437 patients, models excel in questioning (4.43-4.99/5) and ethical conduct (4.38-4.93/5), but fail in diagnostic accuracy (2.63-3.55/5) and information coverage (2.08-3.02/5). Weaknesses include repetitive questioning, omitted medical history, inadequate uncertainty handling.

Evals Reasoning AI safety

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

Study of second-order bias in LLMs: how models judge biased content, beyond generation. Grounded in entitlement epistemology, the method evaluates whether LLMs infer demographics without sufficient support. Findings: systematic bias across target groups, evasion of safety guardrails, persistence of demographic triggers.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 17

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Study of 450 chest X-ray reports showing LLM rewriting for standardization preserves image-text alignment (2.5% degradation) but erodes 26.8–29.3% of clinical entities and 14.9–16.5% of uncertainty language. The paradox: tasks producing 'cleaner' text pull content away from images.

Vision RAG Evals

SIG

HYP

arXiv cs.LG·Jun 17

The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance

InferBERT combines transformers with Do-calculus to detect causal adverse drug events in pharmacovigilance. Comparative study on AILF and TRAM benchmarks: BioBERT outperforms XGBoost, ALBERT, and Med-LLaMA. Finding: domain-specific pre-training outweighs model size.

Benchmarks Fine-tuning AI safety

SIG

HYP

arXiv cs.LG·Jun 17

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

MM++ is an unsupervised, post-hoc method for out-of-distribution detection. It fuses intermediate layers selected by entropy density with the final representation using Ledoit-Wolf regularized covariance, requiring no auxiliary OOD data, fine-tuning, or architectural changes.

Evals AI safety

SIG

HYP

arXiv cs.LG·Jun 17

Memory-Efficient Meta-Reinforcement Learning for Adaptive Safety-Critical Control in Adversarial Spacecraft Proximity Operations

Comparative study of three recurrent architectures (LSTM, GRU, Mamba) and two algorithms (PPO, SAC) for meta-reinforcement learning applied to input-constrained control barrier functions (ICCBF) in spacecraft proximity operations. Mamba + PPO outperforms other setups in safety, task completion, and fuel savings across cooperative and adversarial scenarios.

Reinforcement learning AI safety Robotics

SIG

HYP

arXiv cs.LG·Jun 17

MorphStrata: Layer-Specific Perturbations for Generating Morphence Students in Time-Series Moving Target Defense

MorphStrata enhances Moving Target Defense for time-series forecasting models via selective layer-specific stochastic noise injection. Tested on Transformer with FGSM, BIM and PGD attacks, the approach reduces adversarial RMSE by up to 97.97% on AEP data with training overhead <1%.

Benchmarks AI safety Papers

SIG

HYP

arXiv cs.LG·Jun 17

Credibility-Weighted Pricing of Autonomous Vehicle Liability Under Operational Design Domain Shift

Hierarchical Bayesian credibility framework for pricing autonomous vehicle liability under operational design domain shifts. Tested on 648 verified Waymo crashes (4 US cities, 116M miles): credibility weights moderate (0.12-0.46), partial pooling decisively outperforms no pooling, learned kernel advantage detectable at ~12 deployed cities.

AI safety Benchmarks Regulation

SIG

HYP

arXiv cs.AI·Jun 17

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Clinical decision support AI system using Digital Twins, Treatment Effect estimation, and Reinforcement Learning for adaptive real-time treatment recommendations. Validated on synthetic simulator and TCGA ovarian cancer dataset. Safety module with rule-based vital sign monitoring and clinician escalation for high-uncertainty cases.

Reinforcement learning Reasoning AI safety

SIG

HYP

arXiv cs.CL·Jun 17

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG is a multi-agent system driven by Variational Free Energy to reduce hallucinations in Multimodal Retrieval-Augmented Generation. It uses Monte Carlo Tree Search, logit perturbations, and specialized agents to route high-risk queries and perform post-hoc factual verification. Authors introduce ModeVent, a challenging subset of MultiVent dataset, to evaluate M-RAG robustness.

RAG Multi-agent Vision

SIG

HYP

arXiv cs.CL·Jun 17

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

STATEWITNESS, an activation explainer, detects deception in reasoning LLMs by reading the target model's hidden states and answering natural-language queries. Achieves 0.916 mean AUROC, 11.6% relative gain over best black-box text monitor, 25.0% over best activation-probe baseline. Provides token- and sentence-level evidence traces for human inspection.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 17

Probing, Fusion, and Trustworthiness: A Systematic Evaluation of Foundation Model Representations for Multimodal Cancer Analysis

Systematic evaluation of foundation model representations (5 FMs) on computational pathology tasks using whole-slide images and transcriptomic profiles (IH-BC, IH-NSCLC cohorts). Multimodal fusion improves performance when no single modality dominates. Conformal prediction shows true diagnosis remains recoverable in prediction sets for majority of failed predictions.

Vision Benchmarks AI safety

SIG

HYP

arXiv cs.LG·Jun 17

CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models

CheckMIABench introduces a benchmark for principled evaluation of membership inference attacks (MIAs) on language models. Leveraging intermediate checkpoints from open-source models (Pythia, OLMo, 70M–7B parameters), the authors construct reliable testbeds where training data before and after a fixed point share the same distribution. They evaluate six published attacks and release a modular library (pandora_llm).

Papers Benchmarks AI safety

SIG

HYP

arXiv cs.AI·Jun 17

SpeechDx: A Multi-Task Benchmark for Clinical Speech AI

SpeechDx is a multi-task benchmark for clinical speech AI covering 12 datasets and 27 tasks across diverse health conditions. Tasks are structured by speech production stages (conceptualization, formulation, articulation). Evaluation of 12 audio encoders shows large-scale speech models outperform domain-specific ones, but none generalize reliably across clinical speech.

Benchmarks Voice Evals

SIG

HYP

arXiv cs.AI·Jun 17

Surrogate Assisted Pedestrian Protection Design via a Foundation Model Orchestrated Workflow

Foundation model-orchestrated workflow for pedestrian protection design. Integrates ML surrogate (R²=0.87), multi-objective evolutionary search, geometry generator, and LLM interface. Reduces evaluation time from hours to seconds; generates 35 safety-compliant alternatives in automotive bumper case study.

AI Agents Vision Reasoning

SIG

HYP

arXiv cs.AI·Jun 17

Using Cognitive Models to Improve Language Model Simulation of Human Persuasion Games

Researchers propose Equation-to-Behavior Prompting to guide LLMs to simulate diverse cognitive models (Bayesian, motivated reasoning, Grether's α-β model). Large models approximate these specifications via prompting, but small models fail. RL training reduces belief error by 26.5% and improves performance by 2.5–12% on legal persuasion games.

Reasoning Reinforcement learning Evals

SIG

HYP

arXiv cs.AI·Jun 17

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench evaluates how language models handle off-procedure inputs in industrial diagnostic dialogue. A dataset of 1,676 multi-turn conversations derived from 50 diagnostic flowcharts reveals models often select a real but contextually inadequate step rather than hallucinate, exposing a vulnerability: plausible but wrong advice grounded in documentation.

Benchmarks Evals Reasoning

SIG

HYP