A User-assisted Approach to Multiple Instrument Music Transcription
MetadataShow full item record
The task of automatic music transcription has been studied for several decades and is regarded as an enabling technology for a multitude of applications such as music retrieval and discovery, intelligent music processing and large-scale musicological analyses. It refers to the process of identifying the musical content of a performance and representing it in a symbolic format. Despite its long research history, fully automatic music transcription systems are still error prone and often fail when more complex polyphonic music is analysed. This gives rise to the question in what ways human knowledge can be incorporated in the transcription process. This thesis investigates ways to involve a human user in the transcription process. More specifically, it is investigated how user input can be employed to derive timbre models for the instruments in a music recording, which are employed to obtain instrument-specific (parts-based) transcriptions. A first investigation studies different types of user input in order to derive instrument models by means of a non-negative matrix factorisation framework. The transcription accuracy of the different models is evaluated and a method is proposed that refines the models by allowing each pitch of each instrument to be represented by multiple basis functions. A second study aims at limiting the amount of user input to make the method more applicable in practice. Different methods are considered to estimate missing non-negative basis functions when only a subset of basis functions can be extracted based on the user information. A method is proposed to track the pitches of individual instruments over time by means of a Viterbi framework in which the states at each time frame contain several candidate instrument-pitch combinations. A transition probability is employed that combines three different criteria: the frame-wise reconstruction error of each combination, a pitch continuity measure that favours similar pitches in consecutive frames, and an explicit activity model for each instrument. The method is shown to outperform other state-of-the-art multi-instrument tracking methods. Finally, the extraction of instrument models that include phase information is investigated as a step towards complex matrix decomposition. The phase relations between the partials of harmonic sounds are explored as a time-invariant property that can be employed to form complex-valued basis functions. The application of the model for a user-assisted transcription task is illustrated with a saxophone example.
- Theses