Show simple item record

dc.contributor.authorQian, Xinyuan
dc.date.accessioned2020-11-13T16:26:43Z
dc.date.available2020-11-13T16:26:43Z
dc.date.issued2020
dc.identifier.urihttps://qmro.qmul.ac.uk/xmlui/handle/123456789/68289
dc.descriptionPhD Thesisen_US
dc.description.abstractThis thesis concerns the problem of target localization and tracking in an indoor environment using audio and video signals. We first tackle the problem of audio-only single speaker 3D localization and tracking using a large planar microphone array where the main challenge is the ambiguity of deciding which side of the symmetric array locates the target. As a solution, we propose a novel method that post-processes the peak variations of the acoustic map and exploits the attenuation introduced by the array frame. Considering the fusion benefits resulting from the multi-modal complementarity, we then focus on 3D multi-target tracking using audio-visual signals. To facilitate the research in human-robot interaction, a compact platform is used, which consists of an eight-element circular microphone array of 20-cm diameter with a standard RGB camera mounted on top of it. The main challenges include the estimation of target distance and a varying number of concurrent speakers. To tackle these challenges, we propose a novel tracker using the particle filtering framework. This tracker exploits 3D visual observations derived from image face detections to assist audio processing, by constraining the acoustic likelihood on the horizontal plane defined by the predicted height of a target. This solution allows the tracker to estimate the sound source distance. Besides, we apply a color-based visual likelihood on the image plane to compensate for mis-detections caused by non-frontal face orientations to the camera. Considering a more realistic scenario including a varying number of concurrent speakers, We then extend the tracker by introducing track birth and death. A de-emphasis technique is applied on acoustic map for iterative multi-speaker localization. The resulting 3D audio observations are helpful to initialize new tracks and to follow the targets when they are invisible on the image. Moreover, due to the lack of annotated audio-visual dataset, we collect and annotate a new one which includes a variety of challenges such as occlusion, illumination changes, target clapping and stomping to validate our work and to contribute the research community.en_US
dc.language.isoenen_US
dc.publisherQueen Mary University of Londonen_US
dc.titleMulti-target Localization and Tracking using Audio-visual Signalsen_US
dc.typeThesisen_US
rioxxterms.funderDefault funderen_US
rioxxterms.identifier.projectDefault projecten_US


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

  • Theses [4235]
    Theses Awarded by Queen Mary University of London

Show simple item record