Improving Robotic Manipulation with Efficient Geometry-Aware Vision Encoder

Abstract

Existing RGB-based imitation learning approaches typically employ traditional vision encoders such as ResNet or ViT, which lack explicit 3D reasoning capabilities. Recent geometry-grounded vision models, such as VGGT, provide robust spatial understanding and are promising candidates to address this limitation. This work investigates the integration of geometry-aware visual representations into robotic manipulation. Our results suggest that incorporating the geometry-aware vision encoder into imitation learning frameworks, including ACT and DP, yields up to 6.5% improvement over standard vision encoders in success rate across single- and bi-manual manipulation tasks in both simulation and real-world settings. Despite these benefits, most geometry-grounded models require high computational cost, limiting their deployment in practical robotic systems. To address this challenge, we propose eVGGT, an efficient geometry-aware encoder distilled from VGGT. eVGGT is nearly 9 times faster and 5 times smaller than VGGT, while preserving strong 3D reasoning capabilities.

We present efficient geometry-aware vision encoder for improving robotic manipulation.

Non-geometry-aware vs. Geometry-aware

This paper explores the replacement of conventional 2D vision encoders in robotic manipulation with the geometry-based encoder to more effectively capture the 3D global context.

As shown in the figure below, incorporating our geometry-aware vision encoder can improve performance by up to 6.5%.

Figure: Relationship between the leveraging of geometry-aware representations and robotic success rate.

efficient VGGT

eVGGT is our proposed geometry,r, designed to facilitate the transfer of knowledge from geometry-aware networks. It is a lightweight variant that is 5× smaller and 9× faster, while still maintaining robust geometric reasoning capabilities.

We apply knowledge distillation to train a compressed eVGGT from a VGGT, using the same architecture but with a reduced number of transformer blocks (from 24 to 4) and substituting DINO-ViT-S for DINO-ViT-L.

Figure: Efficiency comparison of eVGGT and VGGT.

Acknowledgements

This website template is adapted from HyperNeRF.