arXiv cs.LG·1 June 2026

idSCD: Identifying Training Datasets through Semantic Correlation Descriptors

Signal

Hype

In three linesNew method to identify whether a dataset was used in model training by analyzing semantic correlation descriptors (SCDs) learned internally. White-box approach outperforms black-box baselines (RMIA, LiRA) with gains up to 60% ROC-AUC on NLI, emotion, and medical text classification tasks.

Read source

Your take?

Papers AI safety Evals

Summary generated by Claude — human-verified

idSCD: Identifying Training Datasets through Semantic Correlation Descriptors

Other angles on this story