Humans navigate and interact with the 3D world robustly without complicated 3D sensors like lidars, but 2D sensors in the eyes. If compared (rather naively) to widely available camera sensors the human retina has vastly diminished capabilities, such as resolution, refresh rate etc. How then can humans interact with the 3D world so robustly?
This thesis aims to develop a system that allows robots to acquire manipulation skills directly from human demonstration videos. The novelty of this system is to actively command the robot to perform exploratory actions and gather additional sensory information rather than solely relying on passive observed information from demonstrations.