Topic

#Alignment

Alignment, in AI, refers to the challenge of ensuring a model behaves in accordance with human intentions and values. OpenAI's GPT-4 was trained using RLHF (reinforcement learning from human feedback) to reduce harmful or misleading outputs.

40Articles
7Sources
71Avg. signal
arXiv cs.AI·

ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

ChatHealthAI aligns structured EHR representations from a pretrained EHR foundation model with a frozen LLM's semantic space via a task-aware resampler. The multimodal framework integrates longitudinal patient representations with refined clinical event descriptions, improving interpretable clinical reasoning while maintaining competitive predictive performance on the EHRSHOT benchmark.

RAGReasoningEvals
SIG
72
HYP
00
Reddit r/MachineLearning·

Backpropagation destroys V1 brain alignment in one epoch, tracking RSA alignment to fMRI across training for BP, FA, predictive coding, and STDP [R]

Comparative study of learning rules (backprop, feedback alignment, predictive coding, STDP) via RSA alignment with human V1 fMRI. Backprop destroys 90% of V1 alignment after 1 epoch (r: 0.102→0.011), while PC and STDP lose only 25-31%. At epoch 40: PC/STDP >> BP/FA. Suggests fundamental trade-off between global error signals (higher layers) and early-layer alignment.

AlignmentBenchmarksPapers
SIG
78
HYP
00
arXiv cs.CL·

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

arXiv study on LLM adaptation limits for annotation tasks. Toxicity detection experiments across diverse datasets show 66% of zero-shot errors resist correction via prompting (rescue rate 34.8%). Models follow misaligned definitions while maintaining confidence. Definition-Specific Familiarity (DSF) metric correlates with performance (r=+0.41), outperforming memorization metrics.

Prompt engineeringEvalsBenchmarks
SIG
78
HYP
00
arXiv cs.AI·

Capability Self-Assessment: Teaching LLMs to Know Their Limits

Modern LLMs systematically overestimate their competence and attempt unsolvable queries. Researchers propose Capability Self-Assessment (CSA), formulated as a policy-learning problem using reinforcement learning, to teach models to recognize their limits. RL significantly outperforms supervised fine-tuning, preserves original capabilities, and generalizes out-of-distribution.

Reinforcement learningAlignmentEvals
SIG
78
HYP
00
arXiv cs.CL·

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

Researchers reveal that statistical watermarks in LLMs are vulnerable to linear ensembles. Averaging probability distributions across 3-5 models cancels out watermark perturbations. WASH (Watermark Attenuation via Statistical Hybridisation) defeats detection across 6 watermarking schemes, reducing z-scores from 5-300 to <2 (threshold: 4), while improving output quality by 27.5%.

AI safetyAlignmentPapers
SIG
82
HYP
00
arXiv cs.AI·

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

LLM-FACETS is an open-source framework for evaluating LLM factuality, epistemic calibration, and reproducibility. Web interface, plugin architecture, deterministic metrics (BLEU, ROUGE, BERTScore) run locally, log-probability visualization, multi-judge consensus, RAG Triad metrics. Designed for technical experts, domain experts, and compliance officers per EU AI Act and NIST standards.

EvalsAI safetyAlignment
SIG
78
HYP
00
arXiv cs.CL·

COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models

COFT is a training-free decoding method that reduces biases in LLM chain-of-thought generation. It uses masked counterfactual prompts and logit fusion to attenuate attribute-driven biases, with distribution-free marginal validity guarantees. Evaluation across 6 models: 30-55% bias reduction (median 38%) with negligible utility loss and ≤11% computational overhead.

ReasoningAI safetyAlignment
SIG
78
HYP
00
arXiv cs.LG·

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

A new counterfactual evaluation metric (CSS) reveals that six frontier models ranked similarly on traditional coverage-based metrics rank in nearly opposite order when assessed on their ability to update clinical recommendations in response to oncology case mutations. All models fail on surgery-status interventions, a safety blind spot invisible to coverage metrics.

BenchmarksEvalsAI Agents
SIG
82
HYP
00
arXiv cs.LG·

Supervised Training Rapidly Degrades Early Visual Cortex Alignment Across Biologically Plausible Learning Rules

Untrained neural networks match early visual cortex better than trained networks. Study on 720 THINGS images and fMRI from 3 subjects shows one training epoch reduces V1 alignment by 25-90% depending on learning rule. Backpropagation degrades most (Δr = -0.080), while predictive coding and STDP preserve alignment better (Δr ~ -0.04).

PapersReasoningAlignment
SIG
75
HYP
00