Recognition of Sound Sources and Acoustic Events in Music and Environmental Audio
Abstract
Hearing, together with other senses, enables us to perceive the surrounding world through sensory
data we constantly receive. The information carried in this data allow us to classify the
environment and the objects in it. In modern society the loud and noisy acoustic environment
that surrounds us makes the task of "listening" quite challenging, probably more so than ever
before. There is a lot of information that has to be filtered to separate the sounds we want to hear
at from unwanted noise and interference. And yet, humans, as other living organisms, have a remarkable
ability to identify and track the sounds they want, irrespectively of the number of them,
the degree of overlap and the interference that surrounds them. To this day, the task of building
systems that try to "listen" to the surrounding environment and identify sounds in it the same way
humans do is a challenging one, and even though we have made steps towards reaching human
performance we are still a long way from building systems able to identify and track most if not
all the different sounds within an acoustic scene.
In this thesis, we deal with the tasks of recognising sound sources or acoustic events in two
distinct cases of audio – music and more generic environmental sounds. We reformulate the
problem and redefine the task associated with each case. Music can also be regarded as a multisound
source environment where the different sound sources (musical instruments) activate at
different times, and the task of recognising the musical instruments is then a central part of the
more generic process of automatic music transcription. The principal question we address is
whether we could develop a system able to recognise musical instruments in a multi-instrument
scenario where many different instruments are active at the same time, and for that we draw
influence from human performance. The proposed system is based on missing feature theory
and we find that the method is able to retain high performance even under the most adverse of
listening conditions (i.e. low signal-to-noise ratio). Finally, we propose a technique to fuse this
system with another that deals with automatic music transcription in an attempt to inform and
improve the overall performance.
For a more generic environmental audio scene, things are less clear and the amount of research
conducted in the area is still scarce. The central issue here, is to formulate the problem
of sound recognition, define the subtasks and associated difficulties. We have set up and run a
worldwide challenge and created datasets that is intended to enable researchers to perform better
quality research in the field. We have also developed proposed systems that could serve as baseline
techniques for future research and also compared existing state-of-the-art algorithms to one
another, and also against human performance, in an effort to highlight strengths and weaknesses
of existing methodologies.
Authors
Giannoulis, DimitriosCollections
- Theses [3824]