Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

UC San Diego
Accepted at RA-L 2022

Our method improves vision-based robotic manipulation by fusing information from multiple cameras using transformers. The learned RL policy transfers from simulation to a real robot, and solves precision-based manipulation tasks directly from uncalibrated cameras, without access to state information, and with a high degree of variability in task configurations.


Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work.

We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in 75% of trials versus 38% and 13% for multi-view and single-view baselines, respectively.


Real-robot Setup

Image depicting our real-world environment for the Peg in Box task. The third-person camera is static, and the ego-centric camera moves along with the robot arm.

Camera views

Samples from our simulation and real world environments for each view and two tasks. We emphasize that the real world differs in both visuals, dimensions, dynamics, and camera views, but roughly implement the same task as in simulation. Also note that there is only an approximate correspondence between the samples from simulation and the real world as shown here.

Real-robot evaluations

We train policies entirely in simulation and transfer without any fine-tuning to a real robot setup with uncalibrated cameras, no access to state information, and a high degree of task variability.

Task: Reach

Task: Push

Task: Peg-in-box

Task: Hammer


Egocentric and third-person views \(O_{1},O_{2}\) are augmented using stochastic data augmentation, and are encoded using separate ConvNet encoders to produce spatial feature maps \(Z_{1}, Z_{2}\). We perform cross-view attention between views using a Transformer such that features in \(Z_{1}\) are used as queries for spatial information in \(Z_{2}\), and vice-versa. Features are then aggregated using simple addition (\(\oplus\) in the figure), and used as input for a Soft Actor-Critic (SAC) policy. We learn to solve precision-based manipulation tasks directly in simulation, and successfully transfer to a real robot setup with uncalibrated cameras, without any fine-tuning nor access to state information.


Sim2Real results for Reach (top) and Push (bottom) tasks.

Sim2Real results for Peg-in-box (top) and Hammer (bottom) tasks.

Attention Maps

We visualize attention maps learned by our method, and find that it learns to relate concepts shared between the two views, e.g. when querying a point on an object shown the egocentric view, our method attends strongly to the same object in the third-person view, and vice-versa.

Real-robot Results

Success rate comparions of the methods when trained in simulation and transferred to a real robot. We find that while the success of the single-view and multi-view baselines drops considerably when transferring to the real world, our method achieves successful transfer. We conjecture that this drop in performance is due to the reality gap -- and the lack of camera calibration in particular.


  author={Jangir, Rishabh and Hansen, Nicklas and Ghosal, Sambaran and Jain, Mohit and Wang, Xiaolong},
  journal={IEEE Robotics and Automation Letters}, 
  title={Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation},