Back to feed
arXiv cs.LG·

Learn from your own latents and not from tokens: A sample-complexity theory

Signal
82
Hype
15
In three linesTheoretical paper on sample complexity of models predicting their own latent representations (data2vec, JEPA). Proves latent prediction reduces sample complexity from exponential in L (depth) to constant, versus token prediction. Validated on probabilistic grammars and neural networks.

## Latent prediction vs. token prediction: theory catches up with empirics

### 1. What is proved — and why it's non-trivial

Since data2vec (Meta, 2022) and JEPA (LeCun, 2022-2024), the community has empirically observed that predicting one's own latent representations converges faster than predicting raw tokens. But "faster" remained quantitatively vague. arXiv:2605.27734 establishes the first formal bound: for data generated by a probabilistic context-free grammar (PCFG) of depth L, supervised learning or token-level SSL requires a number of samples **exponential in L** to recover the underlying latent tree. Latent prediction (JEPA/data2vec-style) achieves this with a number of samples **constant in L** (up to logarithmic factors).

The PCFG is a deliberately tractable model choice: it generates visible sequences by recursively applying production rules along a tree of hidden symbols of depth L. It serves as a formal proxy for the compositional structure of natural language and images — precisely the regime where LLMs consume orders of magnitude more data than biological learners, as the abstract notes.

### 2. The central result and its three validations

The proof rests on the observation that latent prediction short-circuits the need to infer all L levels of the tree from observable leaves. By directly predicting intermediate representations, the model only needs to estimate local statistics at each level, breaking the exponential dependence on depth.

The authors validate this result on three fronts: - **Hierarchical clustering algorithm**: direct implementation of the theoretical bound, confirms constant complexity in L. - **End-to-end neural network**: predictor-clusterer modules that predict their own latents at each level via gradient descent — proof that the bound holds in a realistic parametric setting. - **data2vec analysis**: the first formal sample-complexity analysis of data2vec, showing it *implicitly* performs hierarchical latent prediction. This was not obvious from the original architecture description.

### 3. The conclusion on H-JEPA: an unexpected loser

The most operationally significant result concerns H-JEPA (Hierarchical JEPA), the explicitly hierarchical variant proposed to stack multiple levels of latent prediction. The authors conclude that **H-JEPA is "largely redundant"**: data2vec already implicitly achieves this hierarchy, and the constant-in-L complexity bound is reachable without explicit stacking.

This is a direct negative signal for teams investing in complex H-JEPA architectures. If the theory holds in regimes more general than PCFGs, the additional engineering of H-JEPA yields no sample-efficiency gain — it adds architectural complexity without provable benefit.

Other potential losers: pure contrastive SSL approaches (SimCLR, MoCo) and standard autoregressive LLMs, which remain in the exponential-in-L regime under this framework. Token prediction — whether masked (BERT) or causal (GPT) — does not benefit from the complexity reduction demonstrated here.

### 4. Limits and actual scope

The PCFG is a toy model relative to the actual training data distribution of LLMs. The authors do not claim their bounds apply directly to GPT-4 or V-JEPA. The open question is how closely the compositional structure of real data resembles a depth-L PCFG — and whether representations learned by current transformers behave like the theoretical latents in the model.

Nevertheless, the result has significant calibration value: it provides a formal mechanistic explanation for the empirical advantage of JEPA observed on visual benchmarks (V-JEPA vs. MAE on Kinetics, for instance) and theoretically justifies LeCun's intuition about world model architecture. For practitioners choosing between pre-training paradigms, this paper provides the first solid theoretical argument in favor of latent prediction — no longer just an empirical observation.

Read source
Your take?
PapersReasoningEvals

Summary generated by Claude — human-verified