Multi-target Localization and Tracking using Audio-visual Signals

Qian, Xinyuan

Publisher

Queen Mary University of London

Metadata

Abstract

This thesis concerns the problem of target localization and tracking in an indoor environment using audio and video signals. We first tackle the problem of audio-only single speaker 3D localization and tracking using a large planar microphone array where the main challenge is the ambiguity of deciding which side of the symmetric array locates the target. As a solution, we propose a novel method that post-processes the peak variations of the acoustic map and exploits the attenuation introduced by the array frame. Considering the fusion benefits resulting from the multi-modal complementarity, we then focus on 3D multi-target tracking using audio-visual signals. To facilitate the research in human-robot interaction, a compact platform is used, which consists of an eight-element circular microphone array of 20-cm diameter with a standard RGB camera mounted on top of it. The main challenges include the estimation of target distance and a varying number of concurrent speakers. To tackle these challenges, we propose a novel tracker using the particle filtering framework. This tracker exploits 3D visual observations derived from image face detections to assist audio processing, by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a target. This solution allows the tracker to estimate the sound source distance. Besides, we apply a color-based visual likelihood on the image plane to compensate for mis-detections caused by non-frontal face orientations to the camera. Considering a more realistic scenario including a varying number of concurrent speakers, We then extend the tracker by introducing track birth and death. A de-emphasis technique is applied on acoustic map for iterative multi-speaker localization. The resulting 3D audio observations are helpful to initialize new tracks and to follow the targets when they are invisible on the image. Moreover, due to the lack of annotated audio-visual dataset, we collect and annotate a new one which includes a variety of challenges such as occlusion, illumination changes, target clapping and stomping to validate our work and to contribute the research community.

Authors

Qian, Xinyuan

URI

https://qmro.qmul.ac.uk/xmlui/handle/123456789/68289

Collections

Theses [4235]