Back to feed
arXiv cs.CL·

Your Multimodal Speech Model Says I Have a Face for Radio

Signal
72
Hype
15
In three linesBias evaluation of multimodal speech recognition models (audio-visual). Researchers create videos pairing different faces with identical audio and measure transcription accuracy variations. Findings: quality-of-service gaps up to 4.05 word error rate points across gender, ethnicity, and intersections on Whisper-Flamingo and Gemini.
Read source
Your take?
VisionVoiceBenchmarksAI safetyAlignment

Summary generated by Claude — human-verified