Kaiming He's Latest Contribution: Just Image Transformer Revives Fundamental Denoising Models

Home

Finance

ICV

Smart City

Digital Live

Cloud

Optics

Home Finance AI ICV Smart City Digital Live Cloud Optics

12/01 2025 479

In the current landscape, denoising diffusion models have deviated from the traditional "denoising" concept.

Rather than directly forecasting clean images, these models employ neural networks to predict noise or noisy data components.

Kaiming He, the visionary behind ResNet and an associate professor at MIT, has brought this discrepancy to light in his latest research paper.

There exists a fundamental distinction between predicting clean data and forecasting noisy quantities. According to the manifold hypothesis, natural data is expected to reside on a low-dimensional manifold, a characteristic that noisy quantities lack.

Drawing on this hypothesis, Kaiming He champions the use of models that directly predict clean data. This approach empowers networks that may seem under-capacity to function effectively in high-dimensional spaces.

It has been discovered that straightforward, pixel-based Transformer models, which are somewhat "chunky" in nature, can serve as potent generative models. These models require no tokenizer, no pre-training, and no extra loss functions.

The overarching architecture is dubbed the "Just Image Transformer" (abbreviated as JiT). The researchers conducted studies on JiT/16 (with an image patch size of p=16) using 256×256 images, and on JiT/32 (p=32) with 512×512 images.

Experiments have demonstrated that ordinary Vision Transformers (ViTs) operating on pixels can achieve remarkable performance using only x-prediction.

The core concept of ViT revolves around the **Transformer on Patches (ToP)**, a principle that the research architecture design adheres to.

The researchers have outlined nine potential combinations of loss space and prediction space. For each combination, they trained a base model (JiT-B) featuring a hidden layer size of 768 dimensions per token.

The findings indicate that when model performance is already at a high level, introducing an appropriate amount of noise can be advantageous.

In the context of x-prediction, there is no necessity to increase the number of hidden units.

Similar to many other neural network applications, network design can be largely independent of the observed dimensionality. While increasing the number of hidden units may offer benefits, it is not a critical factor.

The researchers incorporated several commonly employed general enhancement techniques: SwiGLU, RMSNorm, RoPE, and qk-norm, all of which were initially developed for language models. They also delved into class conditioning within the context.

For the ImageNet 256×256 results, they assessed FID and IS metrics for 50,000 samples. The "pre-training" phase utilized a pre-trained VGG classifier. The parameters encompassed the generator and token decoder, excluding other pre-trained components. The outcomes were roughly proportional to the computational cost per iteration during both training and inference.

The ImageNet 512×512 results revealed that JiT, with its larger image patch sizes, can achieve favorable results with reduced computational effort.

Although v-prediction appears to be the "natural" parameterization for v-loss, its loss value is approximately 25% higher than that of x-prediction. This comparison implies that the task of x-prediction is inherently simpler, as the data lies on a low-dimensional manifold.

The researchers also noted that the loss value for ?prediction is roughly three times higher and exhibits instability.

While the study deliberately refrains from using any additional loss functions, it's worth noting that methods based on latent variables typically depend on tokenizers trained with adversarial and perceptual losses. Consequently, their generation process is not entirely driven by diffusion.

A classifier head was appended after a specific Transformer module (the 4th module in JiT-B and the 8th module in JiT-L). The classifier comprises a global average pooling layer and a linear layer, applying a cross-entropy loss function for the 1000-class ImageNet classification task.

This minor adjustment led to substantial improvements, and the researchers plan to further explore this issue in their future work.

References:

https://arxiv.org/pdf/2511.13720v1

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.

Newest

Links