Back to feed
arXiv cs.CL·

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

Signal
72
Hype
18
In three linesarXiv study on implicit Chinese toxicity attacks (CITA). Three-stage red-teaming framework (harmful intent learning, implicit toxicity enhancement, obfuscation rewriting) generating evaluation data. Seven tested detectors show 69.48% average miss-detection rate. Defense model CITD fine-tuned on CITA data improves robustness.
Read source
Your take?
AI safetyAlignmentEvalsBenchmarks

Summary generated by Claude — human-verified