Robotics and Biology Laboratory

Active Learning to Manipulate Objects From Human Demonstration Videos


Adrian Pfisterer

Xing Li

Oliver Brock


Robots operating in human-centric environments frequently encounter various articulated objects that must be manipulated in order to accomplish a targeted task, such as opening a door. Rather than programming these manipulation actions directly, we will have a human demonstrate motions to the robot through videos. This technique is known as Learning from Demonstration (LfD) or Learning from Observations [1]. Learning from Observations can be exemplified through a scenario where a robot acquires manipulation skills through the analysis of a demonstration video. It offers a straightforward and practical way to learn manipulation strategies, as opposed to the more complex methods of writing code or defining mathematical models. As a result, the person performing the demonstration does not need any prior training or robot programming expertise. In addition, identifying kinematic structures from passive observation does not require establishing a direct correspondence between human actions and possible robot actions, as it is often needed when learning from observations. 

However, most observation-only LfD approaches assume that visual information alone is sufficient for accurately identifying kinematic structures [2]–[4]. This assumption does not hold when the observations are subject to noise or contain occlusions. An illustrative example of this is the observation of a human turning a volume knob, where the hand of the subject is prone to obscuring the knob and making it challenging to observe the demonstration accurately. Additionally, the presence of sensor noise further complicates the accurate observation of the fine motor movements performed. Another limitation of observation-only LfD approaches lies in their ability to correctly identify kinematic structures when actions are critical. For instance, an observation-only approach cannot differentiate between an observation of applying force against a rigid body and one of taking no action on a free body, as it would estimate rigidity in both scenarios. 

This thesis aims to tackle the challenges of extracting kinematic structures from passive observations (i.e. videos). This approach involves actively commanding the robot to perform exploratory actions and gather additional sensory information, while addressing the aforementioned difficulties directly. During the demonstration, the robot dynamically adjusts its viewpoint to optimize the information content observed and minimize occlusions. Subsequently, after observing the demonstration, the robot actively interacts with the kinematic structure and selects informative actions. These deliberate interactions yield supplementary dynamic information, enhancing the accuracy of identifying kinematic structures and reducing the uncertainty associated with the observations. Moreover, the additional information obtained through interaction also contributes to finding an appropriate interaction point for successful manipulation. 

This thesis requires:

  • Background in Robotics and Computer Vision
  • Comfort with ROS, but expertise not required
  • Well-versed in Python 3
  • Experience with PyTorch / TensorFlow
  • Knowledge of Probalistic AI is appreciated

How to apply

You can find all the necessary information here.