Easy grasping with a fixated robot

Humans interact with objects in the 3D world robustly without complicated 3D sensors like lidars. Instead they only have 2D sensors in the eyes. If compared (rather naively) to widely available camera sensors, the human retina has vastly diminished capabilities, such as resolution, refresh rate etc. How then can humans interact with the 3D world so robustly?