Disentangling the Horowitz Factor: Learning Content and Style From Expressive Piano Performance

In the Western art music tradition, expressive piano performance consists of two kinds of information: the score, with pitch and timing expressed in simple musical units along with occasional expression instructions, and the performer’s interpretation of the score, involving variations in tempo, dynamics and articulation. In this paper, we present a novel framework for learning representations that disentangle musical content and performance style from expressive piano performances in an unsupervised manner. Our method is based on an extension of the vector-quantized variational autoencoder (VQ-VAE) with individual content and style branches, along with mutual information (MI) minimization techniques and self-supervising strategies. We performed experiments and ablation studies on the ATEPP dataset, a large set of automatically transcribed virtuosic piano performances with rich stylistic variations, and evaluated the content reconstruction and style discrimination in a style-transfer manner. Our experiments demonstrate that the model learnt separate latent variables that encode musical content (such as pitch and relative timing) and stylistic attributes, as generated samples align well with the content input with low note error rates (NER), and the 40-way style discrimination proxy task outperformed the baseline with top1 accuracy of 0.168.


INTRODUCTION AND RELATED WORK
Expressive music performance is the art of shaping a musical piece by continuously varying interpretative parameters such as tempo and dynamics.Human musicians do not play a piece of music mechanically as written in the printed music score.When we hear a performance, two pieces of information are heard: the conceptualized composition described by strict musical units with occasional, general expression instructions, and the performer's interpretative input that consists of variations like speeding up, slowing down, stressing certain notes or passages, and so on.More importantly, such artistic decisions are often highly specific to each individual performer, and there have been numerous attempts [1,2] to characterise the individual styles of performers (e.g., the so-called "Horowitz factor" [3]).In Fig. 1, a visualization1 is shown to demonstrate such expressive factors.
Such division between structure and aesthetics has also manifested in other domains: in visual arts and general image processing, geometric information can be distilled and isolated from textural properties [4]; in speech and audio, voice and speaker information is learnt separately from the text content.Understanding and disentangling such components leads to applications like image completion Fig. 1.A visual comparison between the metrical score (middle) and Brendel's expressive rendering (bottom) of the opening phrase of Mozart's K310 sonata slow movement (top).Expressive devices such as the asynchrony of the chords and ritardando towards the end of phrase are clearly identifiable.[5], artistic style transfer [6], and speech synthesis [7].However, few researchers attempt to disentangle content and style for expressive music performance.
Previous work on music transformation using audio data [8,9] focuses on isolating timbre from pitch in a similar fashion as in speech transfer learning.For work in the symbolic music realm [10,11,12], the focus usually lies in disentangling aspects of the compositional content such as harmony, texture and arrangement, especially with the aim of controllable music generation.Moreover, unlike compositional disentanglement works that utilize pop music MIDI and MusicXML datasets with annotations of chords and meter [10,13], our work requires expressive performance data that contains simpler (no annotations) but more nuanced information (such as precise onset timing and pedals) [14], and thus brings higher complexity to the model.
Meanwhile, many methods have been applied to learn the disentangled representation of style and content from data.Under the variational autoencoder (VAE) framework, models like FH-VAE [15], DSAE [16], TS-DSAE [9] have been proposed for encoding and generating high dimensional sequential data.We also followed this framework of encoder-decoder structure, but the usual approach based on decoupling global and local tokens [9,17] does not align well with our task, since unlike voice or timbre which can be summarized at the sequence level, expressive deviation is not a global but a time-varying attribute.Other work on disentanglement is based on generative adversarial networks (GANs) [18], but they can be hard to train and require careful hyperparameter tuning [19].Meanwhile, various techniques have been applied in guiding the model for disentanglement, such as minimizing the mutual information between latent variables [20,7] and adversarial training [6].Another viable strategy is to introduce additional information such as the chord progression reconstruction [10].Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
To our knowledge, this is the first work that address music style translation from a performance interpretation perspective.Our contributions can be summarized as follows: • We present the first neural framework for learning content and style representation in expressive piano performance.• We propose new evaluation metrics for this specific task, such as NER for validating the content reconstruction and a proxy performer recognition task for style discrimination.• Using a dataset [14] of 11742 transcribed classical piano performances with rich stylistic variation, our model learnt separate latent representations in an unsupervised manner, outperforming the baseline in both style and content evaluations.

Problem Formulation and Loss Objectives
Based on the assumption that each performance rendering is a combination of musical content and interpretative input, the likelihood of observing the performance sample X given content information Zc and style information Zs is p θ (X|Zc, Zs), where θ is the model parameters.In VAE, we use variational inference to learn an approximate posterior for each latent variable through encoder functions qc(Zc|X) and qs(Zs|X), with optimization proved by evidence lower bound (ELBO).The base loss function for the two-branch VAE is shown in Eq. 1, with reconstruction and Kullback-Leibler (KL) divergence.
Mutual Information Minimization We followed the MINE [21] method to construct a lower bound of mutual information based on the Donsker-Varadhan representation of KL divergence as shown in Eq. 2. By minimizing the mutual information I(Zc, Zs) between the hidden representations Zc and Zs which equals to the divergence of their joint distribution P (Zc,Zs ) and product of marginals P Zc × P Zs , we alleviate possible content leakage and ensure disentanglement.In the equation, the supremum is taken over all functions G such that the two expectations are finite.Given that there is no closed-form computation of mutual information, we use a neural network G to approximate this lower bound of mutual information, and it is optimized along with the main network.
Vector Quantization The technique of vector quantization (VQ) [22] has been proven effective in multiple disentanglement tasks [23].The VQ layer encourages the content encoder output ze(x) to minimize the distance between itself and the nearest codebook vector e.The VQ loss in Eq. 3 is added, where sg(•) is the stop gradient operation.In our experiments, we take the commitment loss weight α as 1.
Our overall loss objective is comprised of the above elements, where β1 and β2 are weighting parameters:

Input Representation
Each piece of symbolic data is represented by four token sequences , corresponding to pitch, velocity, onset and duration.The four sequences are each fed through an embedding layer and then concatenated into input X ∈ R T ×embDim , similar to the compound word (CP) symbolic music tokenization scheme [24].In inference, four separate projection layers invert this process and output generated token sequences P , V , Ō, D.
Vocabulary-wise, following the MIDI standard, P and V both take on 128 values, and O and D take on 2300 values and 700 values, respectively.The time tokens are quantized with 10ms resolution, and we take sequence length T = 128.
Although the MIDI vocabularies of P , V , O, D are discrete, they are not actually categorical as they have continuous semantic meaning of pitch and timing.Thus, in terms of reconstruction loss we experimented with regression into the output with the L2 norm loss instead of cross-entropy classification, so that the distances between vocabulary classes are incorporated into training.

Model Architecture and Training Details
Our overall model architecture is summarized in Fig. 2. As described in section 2.2, the symbolic music input and output sequences are processed via an embedding layer and a projection layer, respectively, from their tokenized representation of MIDI events.
The content encoder EC (•) aims to extract a sequence of latent variables Zc ∈ R T ×LatentDim that only represent the content from the input X.The content encoder is built on top of a convolutional stack and two layers of bidirectional gated recurrent units (GRU) to represent the musical content in a context-aware fashion.As mentioned in section 2.1, the information bottleneck is applied on top of the content encoder via a vector quantization layer with a codebook size of 4096, guiding the branch to focus on localized information.
ES(•) functions as the style branch in our architecture, and aims to factor out the style latent that only represents the expressive deviations.It is built with a similar architecture, but without the VQ layer.Both branches have a variational layer at the end and the latents are sampled according to Zµ and Z 2 σ .We train the model using Adam to minimize the loss from Eq. 4. We trained for 450 epochs, taking about 46 hours in total on two RTX 2080 GPUs.We take embDim = 128 and LatentDim = 512, and for the loss weighting parameters, we used β1 = 0.5 and β2 = 0.5.Ablation studies on other parameters are presented in the results section.
Baseline Given the limited prior work on our topic, we set our baseline as the vanilla VAE framework with the loss objective described in Eq. 1.
Self-Supervised Training Inspired by Cífka et al. [8], we also explore the self-supervised training technique.To ensure that the style encoder only encodes style, we feed the style encoder another segment Xj from the same training set recording as the content input Xi.The rationale is that given the same expressive style throughout a recording, even if Xj has different content, the model should be able to reconstruct Xi with the style latent from Xj and content latent from Xi.Besides the paired segment input, other training objectives and the model architecture remain the same as for the main experiment.

Dataset
The content and style experiments are supported by the ATEPP dataset [14], which contains 11742 tracks of virtuosic solo piano performances in MIDI format obtained via automatic transcription.The transcribed MIDI files contains detailed expressive information such as the key velocity and pedal depths.With 49 pianists performing an overlapping corpus of standard Western classical repertoire, rich stylistic variations are represented in this dataset.The training segments and input representations are generated following the procedure in Sec.2.2.We split the data into train/valid/test sets by each track of music instead of individual segments, as repetition in the music might otherwise compromise the test set.
In this project, we simplify the labelling of expressive style by using performer identity as a proxy.We acknowledge that from a musical perspective, there does not exist a bijective mapping between performer and interpretation style.But given the subjective nature of interpretation, very few objective parameters of performance style have been proposed [25], so this is a reasonable approximation.

Evaluation
We evaluate the effectiveness of our disentanglement model from a style-transfer perspective.In test-time generation, the decoder takes a content input Xc and a style input Xs from a different excerpt, concatenates their hidden representations and decodes an expressive rendering X. Considering effects that the proximity of inputs may have on the results, the following input shuffling schemes are proposed: 1. SR: Xc and Xs are taken from the Same Recording 2. SD: Same performer but Different piece 3. DP: Different recordings from Different Performers At test-time, a set of samples is generated for evaluation of each scheme, by selecting inputs Xc and Xs according to the respective scheme.Content Preservation: For evaluating the faithfulness of content reconstruction, we introduce the measure of note error rate (NER) that is analogous to the word error rate (WER) used in speech recognition [7,20].An alignment of the generated X and content input Xc is produced by Nakamura's algorithm [26].This algorithm employs hidden Markov models (HMMs) to align two symbolic performances and correct errors.The NER is then calculated from the alignment outputs, where Sextra , Swrong and Smissing denote the set of extra, pitch-incorrect and missing notes of the generated MIDI Style Fit As mentioned in section 3.1, stylistic characterization is subjective and no standardized measure exists.Thus, we achieve the evaluation for style fitness via neural approximation.A neural network discriminator D, acting as a probe [27], is trained to evaluate how well the generated samples simulate the ground-truth distribution [28].D is first trained on generated samples and then discriminates on ground truth test data as a 40-way classification task of style discrimination (a few pianists are not present in the test split).D is a simple recurrent neural network consisting of an embedding layer, 2 layers of biGRUs and a softmax projection.The discriminator is trained on the generated data for 300 epochs with an early stopping of 10 epochs to prevent overfitting.Top1 and Top5 accuracy are reported.

Results and Discussions
Table 1 shows the results of our experiments.Both proposed models PERFVAE and PERFVAESS outperform the vanilla VAE baseline.In terms of NER, both models achieved less than 0.2, which means the generated content is roughly 80 percent aligned with the desired content.In terms of style discrimination, on the 40-way classification task we achieved the highest accuracy of 0.168, demonstrating the style-transferred generative output partially matched the groundtruth style distribution.With the content and style evaluation results combined, we can informally say that the disentanglement is partially successful and can be viewed as a starting point for this novel task.
We also note that the self-supervised model performs less accurately in NER (content reconstruction) than the unsupervised ver- sion, which is possibly due to the fact that different content goes through the style branch during the training process.In terms of the reconstruction accuracy regarding the input original, the proposed models actually perform worse than the vanilla VAE baseline.This is possibly due to the fact that more regularization is placed on the proposed models' training objective.Another interesting observation involves different shuffle groups.The results for both content and style measures show a decrease from SR group to DP group.Given that the groups correspond to high proximity and low proximity respectively of the pair of inputs, we can infer that the model struggles as the content and style inputs become more distant and less musically plausible (for example, blending an Ashkenazy performance of a Chopin ballad with Gould's Bach Inventions).
Our subjective observations upon examining the outputs mostly match the objective evaluation.We find that the musical content from content input is generally well-preserved in the styletransferred output.Also, under the shuffle group DP, the output is more disorganized compared to the other two groups, demonstrating that the disentanglement quality is still quite limited.See a subset of examples in https://tinyurl.com/csd-examples.Ablation Study We also performed an ablation study on the effect of VQ codebook size as well as the use of mutual information loss.As shown in table 2, the incorporation of mutual information loss helped the model produce better disentangled results in both content and style measures in all configurations.There are some positive correlation between increasing codebook size and decreasing NER results as well as discrimination accuracy, but not too much significance was observed, even when we increase the codebook size to 8192.This might be attributed to the codebook collapse [29] issue that is common in VQ-VAE.

Latent Space Analysis
In Fig. 3, we analyze the information content learnt in latent variables ZC and ZS by projecting them into the first three principle components using PCA.We prepared a set of data samples, which are created from combinations taken from five different styles and five different musical excerpts.In Fig. 3a and 3b, the colors are based on style labels, and in 3c and 3d the colors are based on musical content.In the dimension-reduced latent space, the style latents from the data points from the same style label have style latents grouped more closely together (Fig. 3a) than their content latents (3b).Similarly in the bottom two plots for data points containing the same musical content, no correlation of style latents is observed (3c) but the content latents show some clustering (3d).

CONCLUSION AND FUTURE WORK
In this paper, we proposed a framework for content and style disentanglement for expressive piano performance.Under the vectorquantized variational autoencoder architecture with mutual information minimization, our model demonstrated effective decoupling of musical content information and performance style.Unlike previous work, we demonstrate the feasibility of unsupervised learning of expressive performance data without score annotation, thus enabling much larger-scale analysis of performance style.We hope this work can shed light on the realm of expressive performance understanding, especially on the relationship between composition elements and interpretative inputs.
In the future, we plan to extract more musically-grounded features by guiding the training, as well as setting up a more standardized profile for style characterization evaluation with the support of subjective assessment.

Fig. 3 .
Fig. 3. Visualization of latent variables, showing greater proximity of performers by style (a) than content (b), and of pieces by content (d) than style (c).
This work is supported by the UKRI Centre for Doctoral Training in Artificial Intelligence and Music, funded by UK Research and Innovation [grant number EP/S022694/1].

Table 1 .
Comparison of different methods in terms of both content and style measures with 0.95 confidence.PERFVAE is the proposed model with the loss objective from Eq. 4; PERFVAESS is the proposed model with self-supervised training as described in Sec. 2.

Table 2 .
Results of ablation studies on the codebook size and mutual information loss, all performed on the SR shuffle group without the self-supervising strategy.