It can hence act as a strong baseline for new 3D ViTs. The resultant ``minimalist'' 3D ViT, named \textbf, performs surprisingly robustly on popular 3D tasks such as object classification, point cloud segmentation and indoor scene detection, compared to highly customized 3D-specific designs. To build a 3D ViT from its 2D sibling, we ``inflate'' the patch embedding and token sequence, accompanied with new positional encoding mechanisms designed to match the 3D data geometry. That invites an (over-)ambitious question: can we close the gap between the 2D and 3D ViT architectures? As a piloting study, this paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture, with only minimal customization at the input and output levels without redesigning the pipeline. However, with the growing hope that transformers can become the ``universal'' modeling tool for heterogeneous data, ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable. Abstract: Vision Transformers (ViTs) have proven to be effective, in solving 2D image understanding tasks by training over large-scale image datasets and meanwhile as a somehow separate track, in modeling the 3D visual world too such as voxels or point clouds.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |