Dynamic Reflections: Probing Video Representations with Text Alignment

Tyler Zhu^1*, Tengda Han², Leonidas Guibas², Viorica Pătrăucean², Maks Ovsjanikov²

¹Princeton University, ²Google DeepMind

^*Work done while at Google DeepMind

Recent work has shown that models trained independently on different data types, like images and text, learn surprisingly similar internal structures. We conduct the first comprehensive study of video and text alignment: more data at test time dramatically improves alignment, completely training-free!

Abstract

The alignment of representations from different modalities has recently been shown to provide insights on the structural similarities and downstream capabilities of different encoders across diverse data types. While significant progress has been made in aligning images with text, the temporal nature of video data remains largely unexplored in this context. In this work, we conduct the first comprehensive study of video-text representation alignment, probing the capabilities of modern video and language encoders. Our findings reveal several key insights. First, we demonstrate that cross-modal alignment highly depends on the richness of both visual (static images vs. multi-frame videos) and text (single caption vs. a collection) data provided at test time, especially when using state-of-the-art video encoders. We propose parametric test-time scaling laws that capture this behavior and show remarkable predictive power against empirical observations. Secondly, we investigate the correlation between semantic alignment and performance on both semantic and non-semantic downstream tasks, providing initial evidence that strong alignment against text encoders may be linked to general-purpose video representation and understanding. Finally, we correlate temporal reasoning with cross-modal alignment providing a challenging test-bed for vision and language models. Overall, our work introduces video-text alignment as an informative zero-shot way to probe the representation power of different encoders for spatio-temporal data.

Background: The Platonic Representation Hypothesis

A fascinating idea in machine learning is the Platonic Representation Hypothesis (PRH) [1]. It suggests that as we train larger and more capable neural networks on diverse data, their internal representations start to look alike. Even if one model is trained on images and another on text, they begin to organize their "understanding" of the world in a similar way. This shared structure is called a "Platonic representation."

This has been demonstrated for static data like images and text [2]. But what about video? Video adds the dimension of time, which contains rich information about dynamics, causality, and motion. Does the PRH extend to this more complex, dynamic domain? That's the central question our work explores.

How Do We Measure Alignment?

To quantify how "similar" two representation spaces are, we need a metric that can compare their geometric structures without requiring them to be in the same coordinate system. We use a method called Mutual $k$-Nearest Neighbors (k-NN) Alignment. The intuition is simple: if two spaces are well-aligned, then points that are close to each other in one space should also be close to each other in the other space.

Our pipeline for measuring alignment. We encode videos and their corresponding captions, build nearest-neighbor graphs in each embedding space, and measure the overlap between these graphs.

Here's how it works for a dataset of paired videos and text captions:

We encode all videos using a vision model and all captions using a text model, creating two distinct embedding spaces.
In each space, we build a nearest-neighbor graph, where each item (a video or caption) is connected to its $k$ closest neighbors.
The alignment score is the fraction of edges that are shared between the two graphs. For example, if video A is a neighbor of video B, is caption A also a neighbor of caption B?

A higher score means the neighborhood structures are more consistent across the two modalities, indicating stronger alignment. This metric is powerful because it's "zero-shot"—it requires no training or fine-tuning to compare the models.

Key Result 1: More Data at Test Time = Better Alignment

Our first major finding is that video-text alignment is highly dependent on the amount of data provided at test time. We found this to be true for both the visual (number of frames) and textual (number of captions) data. This "test-time scaling" effect is surprisingly strong and consistent.

Alignment scaling with frames and captions.

Alignment scales with the number of frames (left) and captions (right) at test time. Video models (like VideoMAEv2) benefit more from extra frames than image models (like DINOv2).

As shown above, increasing the number of frames from a video consistently improves alignment. This makes sense: more frames provide a richer temporal context. Interestingly, dedicated video models like VideoMAEv2 benefit much more from additional frames than image models applied to video, like DINOv2.

Perhaps more surprisingly, using more captions for the same video also provides a significant boost. On the VaTeX dataset, which has 10 different human-written captions per video, we found that using all 10 captions improves alignment by 60% on average compared to using just one. This suggests that the diversity of language in multiple descriptions helps to ground the visual concepts more robustly. We even found that using an LLM to generate multiple synthetic captions from a single detailed one can improve alignment over using the original caption alone.

Predictable Scaling at Test Time

This relationship between test-time data and alignment is so consistent that we can model it with a "scaling law." We found that a saturation-based model fits our observations remarkably well:

$$ \text{score}(n_f, n_c) = S_{\infty} - (C_f n_f^{-\alpha} + C_c n_c^{-\beta}) $$

Here, $n_f$ and $n_c$ are the number of frames and captions. The formula suggests that alignment approaches a maximum, "infinite data" score ($S_{\infty}$) as we add more data. The score is penalized by two terms that decay with the number of frames ($C_f n_f^{-\alpha}$) and captions ($C_c n_c^{-\beta}$). This model fits our empirical data with high fidelity (e.g., $R^2 > 0.97$ for VideoMAEv2), demonstrating that the benefit of additional test-time data is predictable. This is conceptually similar to scaling laws for model training, but applied for the first time to inference.

The Power of Temporal Dynamics

A key question is whether dedicated video models, which are designed to understand temporal dynamics, show better emergent alignment than powerful image models simply applied frame-by-frame. Our results suggest they do.

When we provide only a single frame from a video, top-tier image models like DINOv2 often show stronger alignment than video models. This is expected, as they are specialists in static scenes. However, as we increase the number of frames, the picture changes dramatically. Video models like VideoMAEv2 show a much steeper improvement in alignment score as more frames are added. Their ability to process spatio-temporal information allows them to build a more comprehensive and semantically rich representation that aligns better with language.

This suggests that the temporal dynamics in video are not just noise, but a powerful signal for grounding semantics. Learning about motion, causality, and object interactions over time appears to help models build an internal structure that is more congruent with the concepts expressed in natural language. We also tested this on datasets designed to probe temporal understanding, like Test of Time and VideoComp, and found that while current models show some sensitivity to temporal order, there is still significant room for improvement.

Key Result 2: Alignment is a Proxy for Model Quality

If this emergent alignment is meaningful, it should correlate with how well a video model performs on other tasks. We tested this by comparing the alignment scores of various self-supervised video models (trained without text) to their performance on a range of downstream tasks.

Correlation between alignment and downstream task performance.

Stronger video-text alignment correlates with better performance on both semantic (action classification) and geometric (tracking, depth) downstream tasks.

As shown above, we found a strong positive correlation. Models that align better with text also tend to perform better on semantic tasks like action classification (on datasets like Kinetics and SSv2) and, surprisingly, even on non-semantic, geometric tasks like depth estimation, camera pose estimation, and object tracking.

This is a crucial finding. It suggests that video-text alignment can serve as a powerful, zero-shot indicator of a video model's general capabilities. Evaluating large video models is notoriously expensive, often requiring extensive fine-tuning on multiple benchmarks. Our alignment metric provides a cheap and fast alternative: simply by measuring how well a video model's representations align with a standard text model, we can get a strong signal about its potential performance on a wide array of tasks, saving significant time and computational resources.

Conclusion

Our work is the first comprehensive study of emergent alignment between video and text representations. We've shown that the principles of the Platonic Representation Hypothesis extend to the dynamic world of video. Our key takeaways are:

Test-time data matters: Richer visual and textual data at inference time significantly boosts alignment. This holds for both the number of video frames and the number of diverse captions.
Predictable scaling: This relationship follows predictable scaling laws, which can help in designing data collection and model evaluation strategies.
Alignment as a proxy for quality: Stronger alignment with text is a good indicator of a video model's general-purpose representation capabilities, across both semantic and geometric tasks.

Overall, we introduce video-text alignment as an informative, zero-shot way to probe the representation power of different encoders for spatio-temporal data. The temporal dimension in video appears to provide a powerful signal for grounding semantics, opening up exciting avenues for future research in multimodal learning.

Related Literature

Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. "The Platonic Representation Hypothesis." In Proceedings of the 41st International Conference on Machine Learning, 2024.
Jishnu Maniparambil, Matthew Gwilliam, Puyuan Gu, Suraj Srinivas, Abhinav Shrivastava, and Sudeep Mallya. "Vision-Language Models are Zero-Shot Representation Aligners." arXiv preprint arXiv:2402.19427, 2024.

BibTeX

@misc{zhu2025dynamic,
      title={Dynamic Reflections: Probing Video Representations with Text Alignment}, 
      author={Tyler Zhu and Tengda Han and Leonidas Guibas and Viorica Pătrăucean and Maks Ovsjanikov},
      journal={arXiv preprint arXiv:2511.02767},
      year={2025},
}