Multi-target Localization and Tracking using Audio-visual Signals
Abstract
This thesis concerns the problem of target localization and tracking in an indoor environment
using audio and video signals. We first tackle the problem of audio-only single speaker 3D
localization and tracking using a large planar microphone array where the main challenge is the
ambiguity of deciding which side of the symmetric array locates the target. As a solution, we
propose a novel method that post-processes the peak variations of the acoustic map and exploits
the attenuation introduced by the array frame. Considering the fusion benefits resulting from
the multi-modal complementarity, we then focus on 3D multi-target tracking using audio-visual
signals. To facilitate the research in human-robot interaction, a compact platform is used, which
consists of an eight-element circular microphone array of 20-cm diameter with a standard RGB
camera mounted on top of it. The main challenges include the estimation of target distance and
a varying number of concurrent speakers. To tackle these challenges, we propose a novel tracker
using the particle filtering framework. This tracker exploits 3D visual observations derived from
image face detections to assist audio processing, by constraining the acoustic likelihood on the
horizontal plane defined by the predicted height of a target. This solution allows the tracker
to estimate the sound source distance. Besides, we apply a color-based visual likelihood on
the image plane to compensate for mis-detections caused by non-frontal face orientations to the
camera. Considering a more realistic scenario including a varying number of concurrent speakers,
We then extend the tracker by introducing track birth and death. A de-emphasis technique
is applied on acoustic map for iterative multi-speaker localization. The resulting 3D audio observations
are helpful to initialize new tracks and to follow the targets when they are invisible on
the image. Moreover, due to the lack of annotated audio-visual dataset, we collect and annotate
a new one which includes a variety of challenges such as occlusion, illumination changes, target
clapping and stomping to validate our work and to contribute the research community.
Authors
Qian, XinyuanCollections
- Theses [4235]