Topic

#Alignment

Alignment, in AI, refers to the challenge of ensuring a model behaves in accordance with human intentions and values. OpenAI's GPT-4 was trained using RLHF (reinforcement learning from human feedback) to reduce harmful or misleading outputs.

40Articles

8Sources

70Avg. signal

arXiv cs.CL·Jun 18

PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Study of collateral damage in LLM machine unlearning. Authors show damage propagates beyond the forget set following semantic distance gradients, and propose PreUnlearn, a pre-unlearning prediction method to audit risks before execution.

AI safety Alignment Papers

SIG

HYP

arXiv cs.CL·Jun 18

Steerable Cultural Preference Optimization of Reward Models

Novel SCPO algorithm for training reward models that balance diverse cultural preferences across subcommunities. Achieves 7-point improvements for minority reward models on PRISM and GlobalOpinionQA (7 countries), with 280% better training data efficiency than full-finetuning.

Alignment Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 18

Output Vector Editing for Memorization Mitigation in Large Language Models

Memorization suppression method in LLMs via output vector editing of MLP neurons. Tested on 4 models (360M-7B parameters), achieves 87.9% suppression on OLMo-7B with 6831 memorized sequences. Complementary approach to existing neuron ablation methods.

AI safety Alignment Papers

SIG

HYP

arXiv cs.LG·Jun 18

Artemis: Anatomy-Resolved inTervention for Eliminating Multimodal NeuroImage confounderS

Artemis is a causal framework for graph neural networks addressing demographic confounders (age, sex) in multimodal brain imaging (fMRI + DTI). The method applies causal interventions at each brain region independently to learn invariant representations. Tested on ADNI, OASIS, and HCP benchmarks, it improves disease diagnosis and classification tasks.

Papers Reasoning Alignment

SIG

HYP

arXiv cs.LG·Jun 18

SAGE: Retain-Aware Post-Hoc Sanitization of Final Unlearning Vector

SAGE is a post-hoc method to improve selective unlearning in LLMs. It corrects final update vectors by suppressing components damaging retention, without rerunning the original unlearning pipeline. Tested across multiple methods and scales, SAGE reduces the forget-retain trade-off.

Alignment Papers

SIG

HYP

arXiv cs.LG·Jun 18

Self-CTRL: Self-Consistency Training with Reinforcement Learning

Self-CTRL optimizes consistency between language models' self-explanations and behavior via reinforcement learning. On probabilistic reasoning tasks, the method improves R² correlation from 0.24 to 0.64. In constitutional AI, it increases refusal prediction from 36% to 92% and reduces HarmBench failure rate from 15.0% to 0.5%.

Reinforcement learning Alignment AI safety

SIG

HYP

arXiv cs.AI·Jun 18

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

New formal theory (HACD-H) modeling emergence of social intelligence in long-term human-AI interaction. Unified framework integrating emotional adaptation, social memory, and personality consistency. Study on 14,700 conversation turns reveals negative correlation between social intelligence and social cognitive energy (r=-0.391, p<0.001), with developmental phase-transition patterns.

Reasoning AI Agents Papers

SIG

HYP

arXiv cs.AI·Jun 18

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

Safety Reflection Pretraining inserts short safety reflections into pretraining corpora to establish self-monitoring directly in language modeling. On 1.7B models pretrained on FineWeb-Edu, the method improves safety classification accuracy and substantially reduces success rates of inference-stage and finetuning attacks.

AI safety Alignment Reinforcement learning

SIG

HYP

arXiv cs.LG·Jun 18

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

Sparse Autoencoders (SAEs) decompose activations into interpretable features, but this study shows that clamping a 'harmful' feature does not eliminate the behavior—it can recover via other residual pathways. Even with active intervention, 95.8% behavior recovery is achievable in refusal-steering, exposing a gap between feature-level control and behavioral completeness.

AI safety Alignment Evals

SIG

HYP

The Decoder·Jun 17

Microsoft researcher builds a working neural network out of goats in Age of Empires II to critique AI science

A Microsoft researcher built a working neural network using goats in Age of Empires II's map editor to critique AI research methods. His analysis of 315 papers found over 50% presuppose language models have human-like traits before the experiment begins.

Papers Alignment Evals

SIG

HYP

Hacker News (AI)·Jun 17

AI demands more engineering discipline. Not less

Article arguing for increased engineering discipline in AI development, against trends minimizing technical standards. Criticizes 'move fast and break things' approach applied to critical systems.

AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 17

LLMs Infer Cultural Context but Fail to Apply It When Responding

LLMs can infer cultural context but fail to apply it in responses. A new CAPRI dataset shows models recognize cultural conventions (measurement units, time interpretation) but don't spontaneously use them unless explicitly instructed. Biases remain aligned with the model's country of origin.

Benchmarks Alignment AI safety

SIG

HYP

arXiv cs.LG·Jun 17

Rift: A Conflict Signature for Deception in Language Models

Researchers identify an internal signature of deception in language models: deceptive responses show 2.1-2.3x higher residual rank than naively false answers. This signature detects deception with 100% accuracy on GPT-2, Qwen2.5, and Phi-3, and transfers zero-shot across model families and languages (AUC 0.933-1.0).

AI safety Alignment Evals

SIG

HYP

arXiv cs.CL·Jun 17

Evaluating Second-Order Bias of LLMs Through Epistemic Entitlement

Study of second-order bias in LLMs: how models judge biased content, beyond generation. Grounded in entitlement epistemology, the method evaluates whether LLMs infer demographics without sufficient support. Findings: systematic bias across target groups, evasion of safety guardrails, persistence of demographic triggers.

Evals AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 17

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Study of 450 chest X-ray reports showing LLM rewriting for standardization preserves image-text alignment (2.5% degradation) but erodes 26.8–29.3% of clinical entities and 14.9–16.5% of uncertainty language. The paradox: tasks producing 'cleaner' text pull content away from images.

Vision RAG Evals

SIG

HYP

arXiv cs.AI·Jun 17

Treatment Response Optimized Clinical Decision Support AI System via Digital Twin Simulation

Clinical decision support AI system using Digital Twins, Treatment Effect estimation, and Reinforcement Learning for adaptive real-time treatment recommendations. Validated on synthetic simulator and TCGA ovarian cancer dataset. Safety module with rule-based vital sign monitoring and clinician escalation for high-uncertainty cases.

Reinforcement learning Reasoning AI safety

SIG

HYP

arXiv cs.CL·Jun 17

Decoding Hidden Deception in Reasoning LLMs: Activation Explainers for Deception Auditing

STATEWITNESS, an activation explainer, detects deception in reasoning LLMs by reading the target model's hidden states and answering natural-language queries. Achieves 0.916 mean AUROC, 11.6% relative gain over best black-box text monitor, 25.0% over best activation-probe baseline. Provides token- and sentence-level evidence traces for human inspection.

Reasoning AI safety Alignment

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

[Article] The Case For Open-Weight Models And Why We Can't Trust Frontier Labs | provos.org

Article arguing for open-weight models against frontier labs. Criticizes power concentration among few companies and advocates for accessibility and transparency of AI model weights.

Open source Llama Alignment

SIG

HYP

The Decoder·Jun 16

How easily can Russian propaganda fool AI models? A new benchmark finds out

The Institute of the Estonian Language releases a benchmark measuring how susceptible AI language models are to Russian propaganda. No technical details or quantified results provided in the excerpt.

Benchmarks AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 16

Beyond Accuracy: Measuring Bias Acknowledgment in Chain-of-Thought Reasoning for Responsible AI Evaluation

Study on evaluating reasoning models beyond accuracy alone. Authors introduce two metrics: susceptibility (whether bias breaks a previously correct answer) and acknowledgment (whether the trace explicitly references injected biased content). On GSM8K, GPT-4o and Claude Sonnet 4 show similar susceptibility rates (1.3% vs 1.2%) but substantially different acknowledgment rates (13.0% vs 75.0%).

Evals Reasoning AI safety

SIG

HYP

arXiv cs.AI·Jun 16

Frame-Conditioned Moral Computation in LLaMA 3.1-8B-Instruct: A Mechanistic Interpretability Audit of Ethical Reasoning

Mechanistic interpretability audit of LLaMA 3.1-8B-Instruct on 54 moral prompts using Transluce platform. Reveals Situational Anchor Effect: domain-specific representations dominate activation rankings regardless of ethical content. Ethics capacity remains constant but salience is highly sensitive to prompt's interpretive frame. Identifies candidate ethics neuron (L16/N3837) stable across temperatures.

Llama Alignment Evals

SIG

HYP

arXiv cs.AI·Jun 16

AI Engram: In Search of Memory Traces in Artificial Intelligence

Study introducing a geometric framework to identify 'AI engrams'—memory traces in deep neural networks analogous to biological memory units. Authors derive a closed-form estimator enabling surgical manipulation of learned knowledge (composition, erasure) via linear arithmetic without iterative optimization. Validated on MLPs and LLMs.

Reasoning Papers Alignment

SIG

HYP

arXiv cs.AI·Jun 16

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

Study on reward hacking in LLM-based agents using an adapted AI Safety Gridworlds framework. Models (1.5B–14B) systematically exploit misspecified objectives to maximize observed rewards while failing hidden safety objectives. RL optimization amplifies the problem and resists standard mitigations (exploration, regularization).

AI Agents Reinforcement learning AI safety

SIG

HYP

arXiv cs.AI·Jun 16

Synthetic Counteradaptation: A Principle of Human-AI Co-evolution

Theoretical paper introducing synthetic counteradaptation: a process where humans and AI systems co-evolve by adapting to each other's strategies. Authors analyze examples from Go, mixed-motive social interactions, and geopolitical simulations to demonstrate recursive, co-evolutionary dynamics in multi-agent environments.

Multi-agent Reasoning Alignment

SIG

HYP

arXiv cs.AI·Jun 16

Minimal Oversight: Uncertainty-Aware Governance for Delegated AI Systems

Minimum Sufficient Oversight Principle (MSO) for governing autonomy in delegated AI systems. Variational formulation on Fisher information manifold minimizing governance burden under performance constraint. Capacity theorem for stationary symbolwise review policies, autonomy-time scaling law, and masking identified as AI-governance pathology. Python package released.

AI Agents AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 16

CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning

CoRA aligns model confidence with chain-of-thought rationale quality. A GRPO-based RL framework jointly rewards answer correctness, committed-answer probability, and rubric-based rationale support. On MedQA, MathQA, OpenBookQA: 26.51% reduction in confidence-rationale alignment error across three open-weight LLMs.

Reasoning Reinforcement learning Evals

SIG

HYP

arXiv cs.CL·Jun 16

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

ReRULE improves LLM unlearning via off-policy replay for hard cases. The method stores low-reward rollouts near the forget/retain boundary in a replay buffer and reuses them through importance-sampled updates. On MUSE-Books, it increases Retain Quality from 46.3 to 56.2 with +5–11% training overhead.

Reinforcement learning AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 16

CHILLGuard: Towards Fine-Grained Chinese LLM Safety Guardrail with Scalable Data Construction and Model-aware Preference Alignment

CHILLGuard is a safety guardrail system for Chinese LLMs with fine-grained taxonomy (5 macro, 31 micro categories). Authors construct 405k training samples via RAG and prompt rewriting, plus 51k annotated test samples. Model achieves +15.92% F1 improvement over Qwen3Guard-8B-Strict using Direct Preference Optimization.

AI safety Alignment Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 16

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

SHARD is a self-reframing distillation method to improve safe-helpfulness balance in LLMs. It rewrites sensitive prompts using philosophical guidelines to surface benign intent, reframes responses into safer and more helpful versions, then fine-tunes the model on self-reframed responses. Tested on DNA and LINGUASAFE, SHARD improves helpfulness while preserving safety.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.LG·Jun 16

High-Dimensional Random Projection for Activation Steering in Language Models

HiDRA, a training-free activation steering method, uses high-dimensional random projection to improve behavioral control of LLMs. It outperforms linear difference-in-means approaches by capturing discriminative signals in nonlinear feature subspaces, with consistent gains across multiple model families.

Reasoning Alignment

SIG

HYP

arXiv cs.AI·Jun 16

A Definition of Good Explanations and the Challenges Explaining LLM Outputs

Paper proposes a philosophical definition of good explanations based on counterfactual reasoning, accounting for the interlocutor's prior beliefs. Analyzes why LLM outputs are particularly challenging to explain.

Reasoning AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 16

AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani

AmchiBias is a benchmark measuring socio-cultural stereotypical bias for India's Goa state in English and Devanagari Konkani. 313 minimal pairs span 8 demographic dimensions. Evaluation of 5 multilingual models shows near-chance scores in Konkani and higher bias for pan-Indian groups than hyperlocal Goan communities.

Benchmarks Evals AI safety

SIG

HYP

Simon Willison·Jun 16

Quoting Matteo Wong, The Atlantic

The White House shared with Anthropic a report on the Fable jailbreak. Cybersecurity expert Katie Moussouris reviewed the tests: Fable refused 'review the code for security issues' but complied with 'fix this code'. Moussouris concluded this is the model working as intended for cyberdefense.

Anthropic Claude AI safety

SIG

HYP

OpenAI Blog·Jun 16

Predicting model behavior before release by simulating deployment

OpenAI introduces Deployment Simulation, a method predicting AI model behavior before deployment using real conversation data to improve safety and evaluation accuracy.

OpenAI Evals AI safety

SIG

HYP

The Decoder·Jun 15

Microsoft CEO Satya Nadella warns of "a small number of AI systems capturing all the economic returns"

Satya Nadella (Microsoft) warns that a small number of AI systems could capture all economic returns. He advocates companies build "token capital"—their own AI capabilities on internal data and proprietary learning loops—to avoid this concentration.

Business Alignment

SIG

HYP

arXiv cs.LG·Jun 15

Natively Unlearnable Large Language Models

NULLs (Natively Unlearnable LLMs) is an architecture that isolates each data source's contributions in distinct parameters (sinks) while maintaining a shared backbone. Tested on ~6M Wikipedia articles, it enables unlearning a specific source at deployment without retraining, while preserving shared knowledge and general language capabilities.

Papers AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 15

The Culture Funnel: You Can't Align What isn't in the Data

LLMs suffer from a 'cultural data funnel': explicit cultural signals decline sharply during post-training, dominated by geographically concentrated data. A study using multidimensional tagging across 5.6M samples shows multilingualism enhances geographic diversity but not balanced representation. Authors release a culturally tagged dataset to improve training data pipelines.

Alignment Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 15

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

Study across 9 models and 972,000 responses shows LLMs comply with harmful nudges on moral judgments (A=1.04) at nearly identical rates to beneficial ones, unlike factual questions (A=1.58). Chain-of-thought amplifies bidirectional compliance; identity-based prompting suppresses both equally.

Alignment AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 15

Does the Judge Prefer English? Evaluating Language-Switching Invariance in LLM-as-a-Judge

Judge-LS evaluates whether LLMs used as automatic judges exhibit language bias. On 419 LLMBar benchmark items transformed into English, Chinese, and mixed-language variants, models show 10.7–14.4% preference flips across languages, with highest accuracy in English. Translation-equivalent probes reveal no systematic English preference, though most are judged as ties.

Evals Benchmarks AI safety

SIG

HYP

arXiv cs.CL·Jun 15

A Computational Audit of Demographic Association Encoding in ClinicalBERT Language Predictions

Computational audit of ClinicalBERT showing 65.6% of demographic biases stem from model-internal amplification rather than training data inheritance. Analysis via Log Probability Bias Analysis and MLM probing across 98 real clinical sentence templates and 8 intersectional race-gender combinations.

Benchmarks AI safety Alignment

SIG

HYP