LFRAG: Layout-oriented Fine-grained Retrieval-Augmented Generation on Multimodal Document Understanding
LFRAG introduces a multimodal RAG system using block-level instead of page-level retrieval. A semantic-layout fusion encoder integrates local semantics with global context. On LFDocQA benchmark, LFRAG improves answer accuracy by 7.20% and reduces token consumption by 73.07%.