arXiv cs.CL·4 June 2026

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

Signal

Hype

In three linesNew FFR approach retrieves coherent multi-utterance, multi-image fragments from long-form multimodal dialogues. Two models: F2RVLM (generation + RL with multi-objective rewards) for single-dialogue, FFRS (two-stage indexing + retrieval) for corpus-scale. MLDR dataset introduced, superior performance on benchmarks.

Read source

Your take?

RAG Vision Embeddings Reinforcement learning Benchmarks

Summary generated by Claude — human-verified

Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues

Other angles on this story