The Decoder·24 mai 2026

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

Signal

Hype

En 3 lignesByteDance Seed montre qu'un modèle 7B répond mieux aux questions sur documents longs et visuels que des modèles bien plus grands, même sur documents 4× plus longs que ceux vus en entraînement. L'approche clé : apprentissage par questions plutôt que transcription textuelle.

Lire la source

Ton avis ?

Vision Benchmarks Fine-tuning

Résumé généré par Claude — vérifié par l'humain

ByteDance study finds that asking LMMs questions beats making it transcribe text for long document training

Autres angles sur ce sujet