In this thesis, we propose a method to estimate the kinematic state of mechanical objects not only using vision, but using vision together with audio. Audio sensing has the advantage that it is not subject to the same shortcomings as vision. For example, you can hear a drawer being opened or closed also if the light is switched off, or when somebody steps into your line of sight and you cannot see that drawer anymore.
We explore how we can perform such mutlitmodal fusion to integrate audio and visual estimation of mechanics in a Bayesian framework. For this, we build on a kinematics estimation framework that was previously developed at our lab.