Back to feed
arXiv cs.CL·

LEDGER: A Long-Context Benchmark of Corporate Annual Reports for Grounded Financial Retrieval and Extraction

Signal
82
Hype
15
In three linesLEDGER is a benchmark of 4,999 digitized corporate annual reports to evaluate LLM long-context capabilities in finance. The corpus includes 31 consolidated financial KPIs, 118,048 TREC-style retrieval questions, and extraction tasks on numerically dense documents. Case study: correlation between CEO rhetoric and post-publication market impact.

## LEDGER: Why This Long-Context Financial Benchmark Matters

### 1. The Prior State — and Why It Was Inadequate

LLM evaluation on financial documents relied almost exclusively on plain-text SEC 10-K filings, often truncated, paired with a few dozen question-answer pairs. The most-cited benchmarks (FinQA, ConvFinQA, TAT-QA) operate on isolated excerpts of a few paragraphs, not full documents. The practical consequence: researchers were measuring a model's ability to read a fragment, not to navigate a real 150-300 page annual report mixing tables, charts, footnotes, and CEO narrative prose.

With context windows now reaching 128K to 1M tokens (GPT-4o, Gemini 1.5 Pro, Claude 3.5), this gap became critical. Models could theoretically ingest an entire report, but no serious benchmark verified what they actually extracted from it.

### 2. What LEDGER Delivers

**The corpus**: 4,999 complete corporate annual reports — not sanitized regulatory 10-K filings, but the full shareholder-distributed documents with figures, tables, and CEO letters. Each report is labeled with 31 consolidated financial KPIs (revenue, EBITDA, capex, net debt, etc.) directly linked to market reaction at the earnings publication date.

**Three evaluation tiers**: - *TREC-style retrieval*: 118,048 natural language questions with page-level relevance judgments. This matches the scale of major generalist retrieval benchmarks (MSMARCO, BEIR), but applied to numerically dense documents. - *Conversational needle-in-a-haystack*: single-value lookup within a long document — a direct test of attention precision over extended context. - *Full KPI extraction*: end-to-end task on complete documents, with a provided scoring toolchain.

**The infrastructure**: OCR annotations with quantified inter-annotator agreement, plus a complete extraction/validation/scoring pipeline. This is not a throwaway dataset — it is a reproducible protocol.

### 3. The CEO Rhetoric Case Study: Signal or Noise?

LEDGER's most original demonstration links CEO letter rhetorical register to post-publication market impact. This type of analysis (narrative sentiment vs. abnormal returns) existed in behavioral finance literature but required fragile ad hoc pipelines. LEDGER provides the substrate to run it at scale across 4,999 documents with market KPIs already temporally aligned. For quant teams and systematic funds, this is a potentially actionable alternative signal — provided one controls for collinearity with the fundamental KPIs already in the corpus.

### 4. Winners and Losers

**Potential losers**: proprietary financial RAG solution vendors who differentiated on complex document parsing quality. LEDGER now provides a public yardstick for objective comparison. Models that performed well on FinQA (short excerpts, simple arithmetic) may prove mediocre on long-context KPI extraction — exposing commercially inconvenient gaps.

**Winners**: research teams working on financial RAG finally have a benchmark commensurate with real document complexity. The 118,048 TREC-style questions enable statistically robust retrieval system evaluation (precision, recall, nDCG) where prior benchmarks offered hundreds of questions at best. MLOps practitioners in finance can integrate LEDGER into regression pipelines to detect model degradation on real business tasks.

**Caveats to watch**: the corpus covers annual reports — annual cadence, not quarterly. Sectoral and temporal coverage is not detailed in the abstract. If the corpus skews toward large-cap US companies or a specific period (pre/post-COVID, for instance), model performance conclusions may not generalize to small caps or non-English markets. OCR quality on complex financial tables remains a hard problem — the inter-annotator agreement figures will be the first indicator to examine in the full paper.

Read source
Your take?
BenchmarksRAGReasoning

Summary generated by Claude — human-verified