Delayed Decision-making in Real-time Beatbox Percussion Classification

Abstract Real-time classification applied to a vocal percussion signal holds potential as an interface for live musical control. In this article we propose a novel approach to resolving the tension between the needs for low-latency reaction and reliable classification, by deferring the final classification decision until after a response has been initiated. We introduce a new dataset of annotated human beatbox recordings, and use it to study the optimal delay for classification accuracy. We then investigate the effect of such delayed decision-making on the quality of the audio output of a typical reactive system, via a MUSHRA-type listening test. Our results show that the effect depends on the output audio type: for popular dance/pop drum sounds the acceptable delay is on the order of 12–35 ms.


Introduction
In real-time signal processing it is often useful to identify and classify events represented within a signal.
With music signals this need arises in applications such as live music transcription [Brossier, 2007] and human-machine musical interaction [Collins, 2006, Aucouturier andPachet, 2006].
Yet to respond to events in real time presents a dilemma: often we wish a system to react with low latency, perhaps as soon as the beginning of an event is detected, but we also wish it to react with high precision, which may imply waiting until all information about the event has been received so as to make an optimal classification. The acceptable balance between these two demands will depend on the application context. In music, the perceptible event latency can be held to be around 30 ms, depending on the type of musical signal [Mäki-Patola and Hämäläinen, 2004].
We propose to deal with this dilemma by allowing event triggering and classification to occur at different times, thus allowing a fast reaction to be combined with an accurate classification. Triggering prior to classification implies that for a short period of time the system would need to respond using only a provisional classification, or some generic response.
It could thus be used in reactive music systems if it were acceptable for some initial sound to be emitted even if the system's decision might change soon afterwards and the output updated accordingly. To evaluate such a technique applied to real-time music processing, we need to understand not only the scope for improved classification at increased latency, but also the extent to which such delayed decision-making affects the listening experience, when reflected in the audio output.
In this paper we investigate delayed decisionmaking in the context of musical control by vocal percussion in the "human beatbox" style [Stowell, 2010, Section 2.2]. We consider the imitation of drum sounds commonly used in Western popular music such as kick (bass) drum, snare and hihat (for definitions of drum names see Randel [2003]). The classification of vocal sounds into such categories offers the potential for musical control by beatboxing, and some work has explored this potential in nonreal-time [Sinyor et al., 2005] and in real-time [Hazan, 2005, Collins, 2004. This paper investigates two aspects of the delayed decision-making concept. In Section 2 we study the relationship between latency and classification accuracy: we present an annotated dataset of human beatbox recordings, and describe classification experiments on these data. Then in Section 3 we describe a perceptual experiment using sampled drum sounds as could be controlled by live beatbox classification. The experiment investigates bounds on the tolerable latency of decision-making in such a context, and therefore the extent to which delayed decision-making can help resolve the tension between a system's speed of reaction and its accuracy of classification.

Classification experiment
We wish to be able to classify percussion events in an audio stream such as beatboxing, for example a threeway classification into kick/hihat/snare event types.
We might apply an onset detector to detect events, then use acoustic features measured from the audio stream at the time of onset as input to a classifier which has been trained using appropriate example sounds [Hazan, 2005]. In such an application there are many options which will bear upon performance, including the choice of onset detector, acoustic features, classifier and training material. In the present work we factor out the influence of the onset detector by using manually-annotated onsets, and we introduce a real-world dataset for beatbox classification which we describe below.
We wish to investigate the hypothesis that the performance of some real-time classifier would improve if it were allowed to delay its decision so as to receive more information. In order that our results may be generalised we will use a classifier-independent measure of class separability, as well as results derived using a specific (although general-purpose) classifier.
To estimate class separability independent of a classifier we use the Kullback-Leibler divergence (KL divergence, also called the relative entropy) between the continuous feature distributions for classes [Cover and Thomas, 2006, section 9.5]: where f and g are the densities of the features for two classes. The KL divergence is an informationtheoretic measure of the amount by which one probability distribution differs from another. It can be estimated from data with few assumptions about the underlying distributions, so has broad applicability.
It is nonnegative and non-symmetric, although can be symmetrised by taking the value D KL (f ||g) + D KL (g||f ) [Arndt, 2001, section 9.2]; in the present experiment we will further symmetrise over multiple classes by averaging D KL over all class pairs to give a summary measure of the separability of the distribu-tions. Because of the difficulties in estimating highdimensional densities from data [Hastie et al., 2001, chapter 2] we will use divergence measures calculated for each feature separately, rather than in the highdimensional joint feature space. Note that treating each feature separately will fail to detect some effects on separability caused by feature interactions.
Such interaction effects rarely have a large impact, but would be worth studying in future.
To provide a more concrete study of classifier performance we will also apply a Naïve Bayes classifier [Langley et al., 1992], which estimates distributions separately for each input feature and then derives class probabilities for a datum simply by multiplying together the probabilities due to each feature. This classifier is selected for multiple reasons: • It is a relatively simple and generic classifier, and well-studied, and so may be held to be a representative choice; • Despite its simplicity and unrealistic assump-

Human beatbox dataset: beat-boxset1
To facilitate the study of human beatbox audio we have collected and published a dataset which we call The labelling scheme we propose in Table 1 was developed to group sounds into the main categories of sound heard in a beatboxing stream, and to provide for efficient data entry by annotators. For comparison, the table also lists the labels used for a five-way classification by Sinyor et al. [2005], as well as sym- Unsure of classification --61 bols from Standard Beatbox Notation (SBN -a simplified type of score notation for beatbox performers). 3 Our labelling is oriented around the sounds produced rather than the mechanics of production (as in SBN), but aggregates over the fine phonetic details of each realisation (as would be shown in an International Phonetic Alphabet transcription).
The final column in Table 1 gives the frequency of occurrence of each of the class labels, confirming that the majority (74%) of the events fall broadly into the kick, hihat, and snare categories.

Method
To perform a three-way classification experiment on beatboxset1 we aggregated the labelled classes into the three main types of percussion sound: • kick (label k; 1623 instances), • snare (labels s, sb, sk; 1675 instances), • hihat (labels hc, ho; 2216 instances).
The events labelled with other classes were not included in the present experiment.
3 http://www.humanbeatbox.com/tips/ Figure 1: Numbering the "delay" of audio frames relative to the temporal location of an annotated onset.
We analysed the soundfiles to produce the set of 24 features listed in Table 2. Features were derived using a 44.1 kHz audio sampling rate, and a frame size of 1024 samples (23 ms) with 50% overlap (giving a feature sampling rate of 86.1 Hz).
Each manually-annotated onset was aligned with the first audio frame containing it (the earliest frame in which an onset could be expected to be detected in a real-time system). In the following, the amount of delay will be specified in numbers of frames relative to that aligned frame, as illustrated in Figure 1. We investigated delays of zero through to seven frames, corresponding to a latency of 0-81 ms.
In applying the Naïve Bayes classifier, we investigated four different strategies for choosing features as Table 2: Acoustic features measured (for definitions of many of these see Peeters [2004]; HFC and flux are as in [Brossier, 2007, [Witten and Frank, 2005], using tenfold cross-validation.

Results
The class separability measured by average KL divergence between classes is given in Figure 2, and the peak values for each feature in Table 3. The values of the divergences cover a broad range depending on both the feature type and the amount of delay, and in general a delay of around 2 frames (23 ms) appears under this measure to give the best class separation.
Note that this analysis considers each amount of delay separately, ignoring the information available in earlier frames. The separability at zero delay is generally the poorest of all the delays studied here, which is perhaps unsurprising, as the audio frame containing the onset will often contain a small amount of unrelated audio prior to the onset plus some of the quietest sound in the beginning of the attack. The peak separability for the features appears to show some variation, occurring at delays ranging from 1 to 4 frames. The highest peaks occur in the spectral 25-and 50-percentile (at 3 frames' delay), suggesting that the distribution of energy in the lower part of the spectrum may be the clearest differentiator between the classes.
The class separability measurements are reflected in the performance of the Naïve Bayes classifier on our three-way classification test (Figure 3). When using only the information from the latest frame at each delay the data show a similar curve: poor performance at zero delay, rising to a strong performance at 1 to 3 frames' delay (peaking at 75.0% for 2 frames), then tailing off gradually at larger delays.
When using feature stacking the classifier is able to perform strongly at the later delays, having access to information from the informative early frames, although a slight curse-of-dimensionality effect is visible in the very longest delays we investigated: the classification accuracy peaks at 5 frames (77.6%) and tails off afterwards, even though the classifier is given the exact same information plus some extra features.
Overall, the improvement due to feature stacking is small compared against the single-frame peak performance. Such a small advantage would need to be balanced against the increased memory requirements and complexity of a classifier implemented in a real-time system -although as previously mentioned, We also performed feature selection as described earlier, first using the peak-performing delays given in Table 3 and then using features/delays selected using Information Gain (Table 4). In both cases some of the selected features are unavailable in the earlier stages so the feature set is of low dimensionality, only reaching 24 dimensions at the 5-or 6-frame delay point. The performance of these sets shows a similar trajectory to the full stacked feature set although consistently slightly inferior to it. The Information Gain approach is in a sense less constrained than the former approach -it may select a feature more than once at different delays -yet does not show superior performance, suggesting that the variety of features is more important than the varieties of delay in classification performance.
The Information Gain feature selections (Table 4) also suggest which of our features may be generally best for the beatbox classification task. The 25-and 50-percentile are highly ranked (confirming our observation made on the divergence measures), as are the spectral centroid and spectral flux.
A confusion matrix for the classifier output at the peak-performing delay of 2 frames (for the nonstacked feature set) is given in Table 5, revealing a particular tendency for snare sounds to be misclas- gree of success at 1 or 2 frames delay, while the classification of kicks peaks at around 2-3 frames, and of snares around 4 frames. The snare vs. others subtask shows bimodal results. When we plot the performance of the two-class sub-tasks created by excluding one class of events entirely (Figure 4, lower), we see the bimodality seems due to the strong hihat/snare distinction which can be made as early as 1 frame with the kick/snare distinction peaking much later (4 frames, ∼ 50 ms) and at a lower accuracy.
These results suggest either that the attack segments of kick and snare beatboxing sounds are broadly similar to each other and different from those of hihat sounds, and the differences emerge mainly during the decay segment; or that there are differences which are not captured by our feature set.
We suggest the former may be the dominating factor, because both kick and snare sounds can be produced with bilabial plosive onsets (k and sb in Table  1). Others have studied classification of non-beatbox drum sounds based on brief attack segments, with acceptable results (depending on the exact task) [Tindale et al., 2004, Pachet andRoy, 2009]. Beatboxing may be a more challenging classification task than other percussion because all sounds are produced by the same apparatus in various configurations, rather than by different sounding bodies.
In summary, we find that with this dataset of beatboxing recorded under heterogeneous conditions, a delay of around 2 frames (23 ms) relative to onset leads to stronger performance in a three-way classi-    Since the Naïve Bayes classifier treats features independently, a real-time system could progressively update the classification decision as each new frame arrives, progressively increasing the amount of stacking. In fact, the two-way classification results indicate that the classification task could be spread across frames, using a decision-tree approach [Murthy, 1998] in which a hihat-vs.-others decision could be made at a low latency, and the snare-vs.-kick decision made slightly later. In Section 3 we will study the perceptual quality of a system whose decision is only updated once, in order to create a clear experimental measure of the relationship between delay and quality. However we note that a progressively-updated decision is a useful possibility for the real-time classification task discussed here, to be explored in future work.

Perceptual experiment
In Section 2 we confirmed that beatbox classification can be improved by delaying decision-making relative to the event onset. Adding this extra latency to the audio output may be undesirable in a realtime percussive performance, hence our proposal that a low-latency low-accuracy output could be updated some milliseconds later with an improved classification. This two-step approach would affect the nature of the output audio, so we next investigate the likely effect on audio quality via a listening test.
Our test will be based on the model of a reactive musical system which can trigger sound samples, yet which allows that the decision about which sound sample to trigger may be updated some milliseconds later. Between the initial trigger and the final classification the system might begin to output the most likely sample according to initial information, or a mixture of all the possible samples, or some generic "placeholder" sound such as pink noise. The resulting audio output may therefore contain some degree of inappropriate or distracting content in the attack segments of events. It is known that the attack portion of musical sounds carries salient timbre information, although that information is to some extent redundantly distributed across the attack and later portions of the sound [Iverson and Krumhansl, 1993].
Our research question here is the extent to which the inappropriate attack content introduced by delayed decision-making impedes the perceived quality of the audio stream produced.

Method
We first created a set of audio stimuli for use in the listening test. The delayed-classification concept was implemented in the generation of a set of drum loop recordings as follows: for a given drum hit, the desired sound (e.g. kick) was not output at first, but rather an equal mixture of kick, hihat and snare sounds. Then after the chosen delay time the mixture was crossfaded (with a 1ms sinusoidal crossfade) to become purely the desired sound. The resulting signal could be considered to be a drum loop in which the onset timings were preserved, but the onsets of the samples had been degraded by contamination with other sound samples. We investigated amounts of delay corresponding to 1, 2, 3 and 4 frames as in the earlier classifier experiment (Section 2) -approximately 12, 23, 35 and 46 ms.
Sound excerpts generated by this method therefore represent a kind of idealised and simplified delayed decision-making in which no information is available at the moment of onset (hence the equal balance of all drum types) and 100% classification accuracy occurs after the specified delay. Our classifier experiment (Section 2) indicates that in a real-time classification system, some information is available soon after onset, and also that classification is unlikely to achieve perfect classification accuracy. The current experiment factors out such issues of classifier performance to focus on the perceptual effect of delayed decisionmaking in itself.
The reference signals were each 8 seconds of drum loops at 120bpm with one drum sample (kick/snare/hihat) being played on every eighth-note.
Three drum patterns were created using standard dance/pop rhythms, such that the three classes of sound were equally represented across the patterns.
The patterns were (using notation k=kick, h=hihat, Immediate-onset samples, designed by the first author using SuperCollider to give kick/hihat/snare sounds, but with short duration and zero attack time, so as to provide a strong test for the delayed classification.
This drum set was expected to provide poor acceptability at even moderate amounts of delay.
Roland TR909 samples, taken from one of the most popular drum synthesisers in dance music [Butler, 2006, p. 326], with a moderately realistic sound. This drum set was expected to provide moderate acceptability results.
Amen break, originally sampled from "Amen brother" by The Winstons and later the basis of jungle, breakcore and other genres, now the most popular breakbeat in dance music [Butler, 2006, p. 78]. The sound samples are much less "clean" than the other sound samples (all three samples clearly contain the sound of a ride cymbal, for example). Therefore this set was expected to provide more robust acceptance results than the other sets, yet still represent a commonly-used class of drum sound.  Tests took around 20-30 minutes in total to complete, including initial training, and were performed using headphones.
Post-screening was performed by numerical tests combined with manual inspection. For each par- set of gradings with a low correlation was inspected as a possible outlier. Any set of gradings in which the hidden reference was not always rated at 100 was also inspected manually. (Ideally the hidden reference should always be rated at 100 since it is identical to the reference; however, participants tend to treat MUSHRA-type tasks to some extent as ranking tasks [Sporer et al., 2009], and so if they misidentify some other signal as the highest quality they may penalise the hidden reference slightly. Hence we did not automatically reject these.) We also plotted the pairwise correlations between gradings for every pair of participants, to check for subgroup effects. No subgroups were found, and one outlier was identified. The remaining 22 participants' gradings were analysed as a single group. However, the grading scale is bounded (between 0 and 100) which can lead to difficulties using the standard normality assumption to calculate confidence intervals, especially at the extremes of the scale. To mitigate these issues we applied the logistic transformation [Siegel, 1988, chapter 9]: where x is the original MUSHRA score and the δ is added to prevent boundary values from mapping to ±∞ (we used δ = 0.5). Such transformation allows standard parametric tests to be applied more meaningfully (see also Lesaffre et al. [2007]

Results
For each kit, we investigated the differences pairwise between each of the six conditions (the four delay levels plus the reference and anchor). To determine whether the differences between conditions were significant we performed the paired samples t-test (in the logistic z domain; d.f. = 65) with a significance threshold of 0.01, applying Holm's procedure to control for multiple comparisons [Shaffer, 1995]. All differences were found to be significant with the exception of the following pairs: • Immediate-onset samples:  When applied in a real-world implementation, the extent to which these perceptual quality measures reflect the amount of delay acceptable will depend on the application. For a live performance in which realtime controlled percussion is one component of a complete musical performance, the delays corresponding to good or excellent audio quality could well be acceptable, in return for an improved classification accuracy without added latency.

Conclusions
We have investigated delayed decision-making in realtime classification, as a strategy to allow for improved characterisation of events in real-time without increasing the triggering latency of a system. This possibility depends on the notion that small signal degradations introduced by using an indeterminate onset sound might be acceptable in terms of perceptual audio quality.
We introduced a new real-world beatboxing dataset beatboxset1 and used it to investigate the improvement in classification that might result from delayed decision-making on such signals. A delay of 23 ms generally performed strongly out of those we tested. Neither feature stacking nor feature selection across varying amounts of delay led to strong improvements over this performance, though some of the classification sub-tasks (hihat vs. others) showed peak performance at a lower delay compared to others (kick vs. snare), suggesting that the acoustic signal properties of the classes separate out at different stages.
In a MUSHRA-type listening test we then investigated the effect on perceptual audio quality of a degradation representative of delayed decisionmaking. We found that the resulting audio quality depended strongly on the type of percussion sound in use. The effect of delayed decision-making was readily perceptible in our listening test, and for some types of sound delayed decision-making led to unacceptable degradation (poor/bad quality) at any delay; but for common dance/pop drum sounds, the maximum delay which preserved an excellent or good audio quality varied from 12 ms to 35 ms.