Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting
Signal
72
Hype
18
In three linesarXiv study on implicit Chinese toxicity attacks (CITA). Three-stage red-teaming framework (harmful intent learning, implicit toxicity enhancement, obfuscation rewriting) generating evaluation data. Seven tested detectors show 69.48% average miss-detection rate. Defense model CITD fine-tuned on CITA data improves robustness.Read source
Your take?
Summary generated by Claude — human-verified