Xie Saining's Team Presents New Research: Spatial Structure as the Linchpin of iREPA

12/16 2025 416

Recently, Xie Saining's research team has unveiled a groundbreaking study, a development sparked by a lively debate with netizens that took place over four months ago.

The netizen posited that self-supervised learning (SSL) models ought to be tailored for dense tasks (such as REPA, VLM, etc.), as these tasks hinge on the spatial and local information embedded within patch tokens, rather than the global classification prowess symbolized by the [CLS] token.

Contrarily, Xie Saining argued that employing patch tokens does not inherently signify a focus on dense tasks. He pointed out that the performance of VLM and REPA is closely tied to their scores on IN1K, yet only loosely connected to patch-level correspondences. This discrepancy, he suggested, stems not from the [CLS] token but from the distinction between high-level semantics and low-level pixel similarity.

However, just three months down the line, Xie Saining conceded that his initial assessment lacked thoroughness, and that the research presented in the new paper, iREPA, offered a more profound insight.

On the X platform, Xie Saining remarked that diffusion models serve as renderers for their underlying representations. This novel approach, he claimed, enables a clearer comprehension of the true essence of these representations.

He dubbed this discourse a 'small-scale experiment on the novel network watercooler effect,' where participants engage in debates, discussions, and subsequently endeavor to transform them into bona fide scientific inquiries.

Representation alignment (REPA) steers generative training by distilling representations from potent pre-trained visual encoders into intermediate diffusion features.

Prior to this, a pivotal question needed addressing: Which facet of the target representation is paramount for generation? Is it the global semantic information or the spatial structure?

The prevailing consensus was that superior global semantic performance yields better generative outcomes. To scrutinize this notion, the team embarked on a large-scale empirical analysis involving 27 distinct visual encoders and a variety of model scales.

The team discovered that although PE-Core-G achieved an impressive 82.8% accuracy on ImageNet-1K, its efficacy as a target representation for REPA was subpar.

Furthermore, larger models within the same encoder family might exhibit comparable or even inferior generative performance. For representation alignment, larger model variants frequently result in similar (e.g., DINOv2) or even poorer generative performance.

In essence, a higher volume of global information does not invariably translate to superior REPA performance. Multiple trends suggest that global performance bears a low correlation with generative FID when utilizing REPA.

For instance, SAM2-S, with a mere 24.7% validation accuracy, outperformed other models boasting approximately 60% higher validation accuracy when employed in REPA.

Within the same encoder family, larger encoders might boast higher validation accuracy but exhibit worse generative performance.

Infusing patch tokens with global information via the CLS token can enhance global performance but at the expense of generative performance.

Research has substantiated that spatial structure, rather than global performance, serves as a more reliable indicator of generative performance.

The study also reveals that the correlation between spatial structure and generative performance far surpasses that of linear probing.

Across diverse model scales, the correlation between spatial structure and gFID consistently outperforms that of linear probing.

Researchers introduced two direct modifications to the original REPA training scheme, thereby augmenting the transfer of spatial features from the teacher (visual encoder) to the student (diffusion transformer) model.

One modification entails the utilization of a convolutional projection layer in lieu of a multi-layer perceptron (MLP). The team substituted the MLP with a lightweight convolutional layer that operates directly on the spatial grid.

The conventional MLP projection layer in REPA tends to forfeit spatial information when transferring features from the target representation to diffusion features. In contrast, employing a simpler convolutional layer better preserves spatial information.

The second modification involves the application of a spatial normalization layer, which is added to the image patch tokens of the target representation. By sacrificing some global information to amplify the spatial contrast between image patch tokens, superior generative performance is attained.

The results demonstrate that, under varying target representations and model scales, iREPA consistently converges faster than baseline REPA and continually enhances generative quality across all visual encoders.

Spatial enhancements not only consistently boost performance but also exhibit a larger percentage increase in performance with larger model scales; this suggests that spatial improvements can scale in tandem with increasing model size.

In ablation experiments, both the spatial normalization layer and convolutional projection layer significantly expedite convergence, with the optimal results achieved when employed in tandem.

In summary, the team incorporated spatial enhancements into REPA-E and MeanFlow w/ REPA, yielding consistent performance improvements.

References:

https://x.com/sainingxie/status/2000709656491286870

https://arxiv.org/abs/2512.10794

https://x.com/YouJiacheng/status/1957073253769380258

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.