![](https://crypto4nerd.com/wp-content/uploads/2023/07/1FkHqq020kqGebvZJ6kg2eQ-1024x596.png)
The Vison Transformer (ViT) has become to dominate the field of computer vison. It has demonstrate superior performance and flexibility in handling various input sequence lengths. It’s strong performance has positioned it as a formidable contender to displace conventional convolutional neural network (CNN).
In a new paper Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution, a Google DeepMind research team introduce an advanced version of ViT with Native Resolution ViT (NaViT). This enhanced model is designed to handle input sequences of arbitrary resolutions and aspect ratios, further broadening its potential application in diverse tasks within computer vision.
The team summarizes their main findings in this work as follows:
- Randomly sampling resolutions at training time significantly reduces training cost.
- NaViT results in high performance across a wide range of resolutions, enabling smooth cost-performance trade-off at inference time, and can be adapted with less cost to new tasks.
- Fixed batch shapes enabled by example packing lead to new research ideas, such as aspect-ratio preserving resolution-sampling, variable token dropping rates, and adaptive computation.
NaViT extends ViT with the capability to pack multiple patches from different images in a single sequence, which the researchers termed as Patch n’ Pack. To enable this capability, the team makes some modifications of the original ViT: 1) masked self attention and masked pooling to prevent examples attending to each other; 2) factorized & fractional positional embeddings that enable variable aspect ratios and readily extrapolate to unseen resolutions.
Moreover, Patch n’ Pack makes some new and effective new training techniques applicable. It enables continuous token dropping whereby the token dropping rate can be varied…