Predictive Uncertainty Underlies Auditory Boundary Perception

Anticipating the future is essential for efficient perception and action planning. Yet the role of anticipation in event segmentation is understudied because empirical research has focused on retrospective cues such as surprise. We address this concern in the context of perception of musical-phrase boundaries. A computational model of cognitive sequence processing was used to control the information-dynamic properties of tone sequences. In an implicit, self-paced listening task (N = 38), undergraduates dwelled longer on tones generating high entropy (i.e., high uncertainty) than on those generating low entropy (i.e., low uncertainty). Similarly, sequences that ended on tones generating high entropy were rated as sounding more complete (N = 31 undergraduates). These entropy effects were independent of both the surprise (i.e., information content) and phrase position of target tones in the original musical stimuli. Our results indicate that events generating high entropy prospectively contribute to segmentation processes in auditory sequence perception, independently of the properties of the subsequent event.

Humans make sense of a complex, dynamic world by segmenting sequences of events into manageable units (Kurby & Zacks, 2008;Richmond & Zacks, 2017;Zacks & Swallow, 2007). Past work on segmentation has focused on retrospective cues for boundary identification, often conceptualizing group boundaries as coinciding with instances of increased relative change in stimulus features or low transition probabilities (e.g., speech: Saffran & Kirkham, 2018; action sequences: Hard et al., 2011;music: Hartmann et al., 2017;Pearce et al., 2010). However, the sophisticated prediction capabilities of the human mind (Hutchinson & Barrett, 2019) suggest that event boundaries are also anticipated. For example, in natural conversation, turn taking happens so rapidly that speakers likely anticipate the end of their conversation partner's sentence (Levinson, 2016). Here, we investigated the role of entropy, or the degree of uncertainty about an upcoming event, in determining the perception of group boundaries in auditory sequences. We define prediction as the psychological processes of generating an expectation about a future event in terms of how likely various possible outcomes are. We define uncertainty as the imprecision (or extent of equiprobability) of such a prediction.
Though most previous work has focused on retrospective boundary identification, anticipatory processing has some preliminary support. When self-pacing through sequential images of action sequences, participants tend to dwell (or pause) on perceived boundary images (Hard et al., 2011(Hard et al., , 2019Kosie & Baldwin, 2019a, 2019b. Kosie and Baldwin (2019b) proposed that this dwell-time effect resulted from selective attention to moments of uncertainty afforded by perceiving a goalcompletion event. No cognitive model was devised to test this theory, however, possibly because of challenges related to modeling expectancy in event processing of action sequences. Indeed, one methodological drawback was demonstrated by participants' dwelling on boundary slides even when those slides were out of order, suggesting that they were responding to conceptual salience rather than to underlying expectancy dynamics (Hard et al., 2011). Cohen et al. (2007) proposed an entropy-based segmentation model for language, but because it computes statistics from the corpus it is segmenting-including parts it has not yet seen-it does not fully capture segmentation processing in real time (Christiansen & Chater, 2016).
Because music is not only hierarchically structured (Lerdahl & Jackendoff, 1983) but also statistically well defined, it is an ideal domain for testing psychological theories of probabilistic perception (Koelsch et al., 2019). As with nonmusical sequences (Zacks et al., 2001), there is generally high interparticipant agreement regarding the location of musical-phrase boundaries (Deliège, 1987; but see Pearce et al., 2010), and as with action sequences, listeners self-pacing through musical chords dwell on boundary chords (Kragness & Trainor, 2016. Because entropy correlates strongly with phrase boundaries in music, however (Hansen et al., 2017), previous studies were not optimized to separate prospective effects of expectancy dynamics from effects of canonical boundary features on perceptual grouping. Information dynamics of music (IDyOM; Pearce, 2005) is a computational model of auditory expectation that enables the modeling of boundary perception quantitatively using the information-theoretic concepts of entropy and information content, computed in reference to preexisting long-term knowledge (Hansen et al., 2016;Hansen & Pearce, 2014). Entropy facilitates a test of uncertainty as a prospective mechanism for boundary perception that can be pitted directly against information content (a measure of surprise) as a retrospective cue. For example, an individual may form a highly certain prediction about the next note in a melody but then be surprised when a different note follows. Another advantage of melodic sequences is that any given note has little intrinsic meaning in isolation from its preceding musical context, ensuring that observed effects on perception reflect the statistical structure of the sequence and not inherent features of the boundary stimulus itself. However, because uncertainty is not always available for explicit introspection (Hansen et al., 2016), implicit measures are paramount for investigating the cognitive mechanisms underlying boundary perception.
The present study used IDyOM to control the information-dynamic properties of melodic sequences in two experiments assessing the role of uncertainty in sequence processing. We measured participants' dwell times (Experiment 1) and explicit ratings of phrase completeness (Experiment 2) for tones that afforded high or low entropy and were phrase beginning or phrase ending in the melodies from which they were drawn. We predicted that tones generating high uncertainty would lead to longer dwell times and higher ratings of phrase completeness, regardless of original phrase status, and that this effect would be independent from retrospective surprise.

Statement of Relevance
In everyday contexts such as traffic or manual labor, we are constantly required to interact rapidly and appropriately with complex sensory input. Past experience crucially helps us by enabling us to predict what is about to unfold rather than merely react to what already happened. Although much research has characterized the consequences of making correct (or incorrect) predictions, here we investigated how uncertainty about upcoming events informs people's behavior. Participants heard sequences of musical tones, which varied in the uncertainty they conveyed about what came next. Uncertainty was defined in terms of information-theoretic entropy. Tones affording high uncertainty resulted in more implicit attention and greater perceived phrase completeness. Focusing attention toward points of local uncertainty may thus facilitate efficient perceptual grouping and learning in a complex, dynamic world. years, SD = 3.78, 1 participant declined to report age). None of the participants were professional musicians (for more information about musical-training levels, see Table  S1 in the Supplemental Material available online). This sample size exceeds or corresponds to those of previous studies using this methodology to assess comparable effects (e.g., Hard et al., 2011;Kragness & Trainor, 2016. All participants were fluent in English. Stimuli. Fifty-six monophonic stimulus sequences were selected from the soprano (i.e., highest) part in 370 fourpart chorales by Johann Sebastian Bach (Dörffel, 1875; see Section S1 in the Supplemental Material for details of the stimulus-selection procedure). These chorale melodies are not generally known by present-day listeners in Canada. Unfamiliarity was made more likely by giving participants control over tone durations in the self-paced dwell-time paradigm, thus completely removing any rhythmic information that might allow them to recognize the original piece. All chords, interference tones, and selfpacing tones were generated in Max/MSP's (Version 5.0; Cycling '74, 2008) grand-piano timbre.
Each stimulus context contained a full phrase (musical group) of seven to 17 pitches followed by the initial tone of the subsequent phrase in the original chorale melody. Tones associated with phrase beginnings and endings were unambiguously identified from notations in the musical score. This practice seems at least as objective as reliance on trained expert coders to determine event boundaries in research using visual action sequences (e.g., Hard et al., 2019;Kosie & Baldwin, 2019a, 2019b. We included both phrase endings and phrase beginnings as target tones to provide a strong test of entropy's role in segmentation, controlling for compositional cues in the melodies that might signal melodic phrase endings in other ways. Fourteen stimulus contexts were selected for each of the four experimental conditions, consisting of phrase beginnings with high entropy (BegHi) or low entropy (BegLo) and phrase endings with high entropy (EndHi) or low entropy (EndLo). Entropy, in this setting, quantifies the level of uncertainty governing a listener's expectations about what the pitch of the next tone following the relevant phrase beginning or phrase ending would be. Thus, Western-enculturated listeners are expected to be relatively sure about which pitch will follow the target tone in BegLo and EndLo contexts but relatively unsure in BegHi and EndHi contexts. Target tone, in this setting, refers to the final tone in BegLo and BegHi contexts and the penultimate tone in EndLo and EndHi contexts.
The entropy level generated by each tone in the corpus was estimated by IDyOM (Version 1.3; Pearce, 2005). This variable-order n-gram model uses unsupervised statistical learning to generate probability distributions governing a relevant feature of each tone in a monophonic melody. IDyOM was trained on a large data set of 5,332 German folk songs (Schaffrath, 1995), 152 Nova Scotian songs and ballads (Creighton, 1966), and 120 English hymns (Nicholson et al., 1950). 1 For each tone in the chorale melody, IDyOM generated a probability distribution (summing to 1) over the 44 pitch values occurring in the training corpus (i.e., MIDI Pitches 45-89 corresponding to A2-F6) by combining n-gram models of varying order. Entropy then quantifies the shape of these probability distributions with high entropy for flat (relatively uniform) distributions, in which there is high uncertainty about the next event, and low entropy for "spiky" (relatively nonuniform) distributions, in which one or a small number of continuations are highly probable.
The set of 56 stimulus contexts was selected in a way that prioritized extreme high or low entropy values while ensuring that two conditions were met. For the secondary analysis of all tones in the stimulus set, information content and entropy were reestimated by rerunning IDyOM with the same configuration on the final stimulus contexts. This was done because information content and entropy estimates for the initial tones in each stimulus context sometimes relied on tones from the preceding phrase in the original chorales, which was excluded from the stimuli used. Although this was not problematic for stimulus selection based on target tones, it did present a problem for tone-level analysis. Note that because of their late position in the tone sequences, target-tone entropy and information-content values were identical for the two models (one was used in stimulus generation and analyses of target tones, and the other was used in the analysis of all tones).

Procedure. The experimental procedures received prior approval from the McMaster University Research Ethics
Board and were carried out in accordance with the provisions of the World Medical Association Declaration of Helsinki. Participants were seated facing a computer screen in a sound-attenuated room. They were instructed to press the space bar on a computer keyboard with the index finger of their dominant hand to elicit the onset of each subsequent tone in the sequence. Tones decayed naturally but were not terminated until the space bar was pressed again to initiate the next tone. Participants were instructed to progress as quickly or slowly as they liked while listening carefully, and they could not repeat previously heard tones. To motivate them to attend to the task, we falsely led them to believe that their memory for the sequences would be tested afterward (Kragness & Trainor, 2016). No other instructions regarding timing, pacing, rhythmicity, or expressivity were given. If a participant asked for further information, they were told to play through the piece in a way that would maximize their performance in the subsequent memory task.
Prior to each trial, participants saw three flashes of a fixation cross and then heard forty 50-ms tones (for a total of 2,000 ms). These tones, which were chosen randomly on each trial from range E2 to A5 to minimize carryover from the context of the previous sequence, were followed by three context-establishing chords with durations of 800 ms, 800 ms, and 1,600 ms (Fig. 1). The context-establishing chords were played in the key of the relevant melody. Throughout each trial, a circle on the screen indicated when to begin self-pacing through the melody (light green) and when to stop (dark green).
Data processing and statistical analysis. Despite systematic efforts to avoid duplicate stimulus contexts (e.g., multiple occurrences of a repeated phrase from a single melody or identical phrases across melodies), it was discovered after data collection that one melodic context occurred in both the BegHi and EndHi stimulus sets (with different target tones). Given that results did not differ substantially when we excluded dwell times for these stimuli, we report statistical analyses including the full data set, which consisted of 56 total tone sequences (i.e., 14 per condition).
To mitigate effects of extreme data points, we adopted a minimum dwell-time threshold of 100 ms for inclusion. Dwell times greater than 3 standard deviations above a participant's own average (across all target and nontarget dwell times) were also omitted (Kosie & Baldwin, 2019a, 2019b. These exclusion criteria eliminated an average of 1.31% of all tones and 1.70% of target tones per participant (ranging from 0-4 target tones).
For the main analysis of target tones, target dwell times were averaged by condition, resulting in four condition-wise means per participant. A 2 × 2 repeated measures analysis of variance (ANOVA; including the within-subjects factors boundary status and entropy) was run on target-tone dwell times.
For the secondary analysis of all tones, dwell times were first log transformed to minimize the positive skew inherent to timing data (cf. Kragness & Trainor, 2018). Subsequently, using the lmer() function from the lme4 package (Version 1.1-23; Bates et al., 2015) in R (Version 3.6.2; R Core Team, 2019), we fitted linear mixed-effects models with restricted maximum likelihood estimates. Because previous experiments have found that dwell times change systematically throughout trials (Kragness & Trainor, 2016), tone index in the sequence was always included as a predictor. Thus, whereas the null model included only tone index as a fixed effect, two further increasingly complex models added, first, the retrospective cue IC and, second, the prospective cue entropy. Consequently, we could determine whether prospective predictive processing explained unique variance not already accounted for by retrospective surprise. Random intercepts and slopes of tone number were included for each participant. For all models, this random-effects structure produced the lowest Bayesian information criterion (BIC) values while avoiding singular fits.
We conducted post hoc correlational analyses to examine whether participants' musical sophistication was associated with the magnitude of their dwell-time effect. No significant associations were observed (see Section S2 in the Supplemental Material for more details).
All tones. If uncertainty provides a cognitive cue for phrase segmentation, its effect on dwell times should generalize beyond the target tones occupying the extreme ranges of entropy values. Analyzing dwell times for all tones also allowed us to directly compare the effects of prospective entropy with the effects of retrospective information content. Recall that information content was matched across target tones in the previous analysis.

Experiment 2: Explicit Completeness Ratings
In Experiment 1, participants dwelled longer on tones affording high-entropy continuations than on tones affording low-entropy continuations, regardless of whether they were originally phrase beginnings or endings. This suggests that when rhythmic and metrical cues are removed from the musical surface, entropic peaks in prospective pitch expectancy elicit implicit segmentation.  Depiction of a trial from Experiment 1. In each trial, participants saw three flashes of a fixation cross, followed by interference tones, and then three context-establishing chords. The chords were followed by a signal (shown here as a white circle) to begin self-pacing. They then self-paced through the tone sequence until they saw the stop signal (shown here as a black circle). The inset depicts examples of tone sequences from each of the four conditions, consisting of melodic phrases whose target tones formed phrase beginnings with high entropy (BegHi) or low entropy (BegLo) or phrase endings with high entropy (EndHi) or low entropy (EndLo) in the original chorale melodies. The target tones (boxed) generated relatively uncertain (high entropy) or relatively certain (low entropy) expectations about the pitch of the next tone, matched on information content (IC) of the current tone. The double slash indicates whether target tones were phrase beginnings (after double slash) or phrase endings (prior to double slash) in the original notation.
longer dwell times coincide with perceived boundaries (e.g., Hard et al., 2011), but Experiment 1 did not guarantee that participants were segmenting the stimuli. Therefore, Experiment 2 was designed to provide converging evidence for effects of prediction on segmentation using an explicit self-report measure of phrase completeness (Palmer & Krumhansl, 1987).

Method
Participants. Thirty-one McMaster University students (not participants in Experiment 1) took part in Experiment 2. Again, none were professional musicians (see Section S2 in the Supplemental Material for more information). This sample size exceeds those from previous studies using this methodology to assess a comparable contrast (e.g., Palmer & Krumhansl, 1987). One participant declined to report gender and age, but of the remaining participants (7 men and 23 women), the average age was 18.93 years (SD = 2.51). Of the 31 participants, responses from five individuals were omitted because their response sheets were uninterpretable (i.e., multiple answers for each sequence, lacking answers for certain sequences).
Stimuli. Melodic stimulus sequences were identical to those for Experiment 1, except that all notes were played with a constant duration of 400 ms, and the target tone was always the final tone in the sequence. Unfamiliarity with Bach's chorale melodies was made more likely by presenting stimuli with equal tone durations, thus completely removing any rhythmic information that might allow participants to recognize the original piece.
Procedure. As in Experiment 1, the experimental procedures received prior approval from the McMaster University Research Ethics Board and were carried out in accordance with the provisions of the World Medical Association Declaration of Helsinki. The procedure took place in a sound-attenuating room. Rather than selfpacing through the sequences as in Experiment 1, participants listened to all 56 sequences in randomized order. After each sequence, participants rated how complete the sequence sounded (ranging from 1, totally incomplete, to 7, totally complete). If the end of the melody was completely satisfactory, that would constitute a score of 7, but if the melody ended in a way that was implausible and unsatisfactory, that would constitute a score of 1. Participants were encouraged to use the full range of the scale.
Again, no significant associations with musical sophistication were observed (see Section S2 in the Supplemental Material for more details).

General Discussion
Although prediction is a fundamental component in influential theories of perceptual organization ( Hutchinson & Barrett, 2019), evidence for the role of uncertainty remains weak because of the empirical focus on retrospective measures of surprise (Hansen & Pearce, 2014). Here, we tested the hypothesis that uncertainty relates to boundary perception in auditory sequences, using stimuli from Western tonal music with well-defined phrase boundaries. Sequences ending on tones generating high-entropy expectations were perceived as more  (Cousineau, 2005).
complete than those ending on tones generating lowentropy expectations (Experiment 2). This was also indicated by longer dwell times on high-entropy target tones; indeed, across all tones in the stimulus sequences, entropy explained unique variance in dwell times not accounted for by event probability (Experiment 1). Our work raises the key question of why segmentation follows peaks of uncertainty. Christiansen and Chater's (2016) now-or-never bottleneck posits that information in working memory needs to be processed now or be forever lost. This constraint necessitates "chunk-and-pass" processing whereby fleeting inputsuch as the content of music, speech, or action sequences-is quickly segmented and encoded as higher-level representational units. Following this theory, we reason that events that afford high-entropy predictions may require more bits to encode and thus may require higher working memory deployment. The likelihood of exceeding memory capacity is higher after high-uncertainty events than after low-uncertainty events, causing higher probability of chunking and perceiving a segment boundary.
This framework may also explain previously demonstrated dwell-time effects (Hard et al., 2011(Hard et al., , 2019Kosie & Baldwin, 2019a, 2019bKragness & Trainor, 2016, because there is a time delay associated with segmentation and reintegration into previous knowledge. This reintegration process, however, may have a cost. Specifically, taking in new information is harder while reintegration takes place. Because the human mind aims to be one step ahead, it will attempt to balance this cost optimally. Therefore, pauses in the stimulus stream may induce a chunk to be processed even if it ends on low uncertainty (without fully exceeding working memory capacity). This may constitute one potential mechanism explaining why Gestalt-like principles of temporal proximity generally seem to apply to auditory sequence processing (Lerdahl & Jackendoff, 1983).
The relatively high working memory capacity required at phrase boundaries may explain previously observed phrase-final lengthening. Specifically, across various languages, musical instruments, and performance contexts, speakers and performers tend to slow down at phrase endings (speech: Wightman et al., 1992;music: Palmer, 1989;Repp, 1992). Although originally interpreted as a communicative gesture in music (Palmer, 1989), phrase-final lengthening is exhibited by piano players even when they attempt to play without expression (Penel & Drake, 1998). In addition to the observation that listeners are less prone to detect lengthening on boundary tones than on within-phrase tones (Repp, 1992), Penel and Drake (1998) hypothesized that perceptual biases contribute to group-final lengthening, although the source of this bias remained unspecified. One such source could be processing constraints due to uncertainty, which likely apply across domains of sequential perception and production.
Here, we specifically focused on modeling the uncertainty of a single feature-pitch-as a cue for phrase closure. Of course, the probabilistic characteristics of many other features (e.g., temporal, spectral, syntactic) might affect completeness perception. In music, these might include duration, intensity, interonset intervals, and performer gestures (Lerdahl & Jackendoff, 1983). Whether uncertainty in temporal features influences musical-phrase grouping remains to be tested. However, given that sensory systems prioritize anticipatory over reactive processing ( Christiansen & Chater, 2016;Hutchinson & Barrett, 2019), it seems plausible that our findings should extend to the temporal domain. On the other hand, nonprobabilistic and non-pitch-related features may also constrain statistical learning, giving rise to the entropy effects found here, as observed in speech segmentation (Yang, 2004). Incorporating metrical structure and previously heard motifs and limiting the number of accented tones per phrase would, for example, most likely improve the predictive power of our entropy-based model. Future work should more directly contrast the effect of anticipatory versus adaptive cues and of probabilistic (topdown) versus Gestalt-related (bottom-up) (Cousineau, 2005). establish their relative contributions and investigate how those contributions may vary under different experimental conditions. Another concern is whether IDyOM accurately reflects listener expectations. Morgan et al. (2019) found that IDyOM predictions entailed higher entropy than that computed across several participants providing single-tone sung continuations to melodic contexts. Task constraints likely explain this discrepancy, because expectations for multiple continuations were not assessed. Furthermore, by manipulating entropy of upcoming events rather than simply analyzing the entropy of instantiated continuations, the present study differs crucially from Morgan et al.'s study. Moreover, Morgan et al. recruited self-identified musicians, who make melodic predictions with demonstrably lower average entropy than nonmusicians (Hansen et al., 2016;Hansen & Pearce, 2014), whereas IDyOM was configured to model expectations of the general population. At the same time, Morgan et al. made an important contribution by demonstrating a greater contribution of statistical learning than of Gestalt-based principles in predicting listener expectations. This supports IDyOM's suitability in predicting auditory boundary perception.
The finding that uncertainty influences phraseboundary perception suggests a pertinent role for training effects. Expertise effects may be particularly prominent in the musical domain, where skills and experience differ substantially between individuals. Although some studies suggest limited effects of musical expertise on melodic-segmentation processes (Palmer & Krumhansl, 1987; but see Hartmann et al., 2017), expertise levels have not always been widely sampled or manipulated systematically. The same limitation applies to the current study, in which no significant effects of expertise were seen (see Tables S2 and  S3 in the Supplemental Material for details). Yet recent research shows that stylistic specialization results in expectations about melodic continuations that are generally lower in entropy whenever greater confidence is warranted (Hansen et al., 2016;Hansen & Pearce, 2014). The transformation of high-entropy predictions into low-entropy predictions with domain-relevant training or implicit exposure should allow musicians to perceive phrasal coherence across longer time spans. This would be consistent with observations that experts have access to more abstract and deeper levels of hierarchical structure (Chaffin & Imreh, 2002;Chi et al., 1981), which, in turn, may be associated with larger working memory capacity (Meinz & Hambrick, 2010). Although we are still awaiting sampling across more diverse expertise levels in future research, our results relating chunk size to underlying expectancy dynamics enable a novel interpretation of classical findings pertaining to expertise and working memory.
By offering an empirical challenge to the view that segmentation primarily relies on retrospective processes, the present work contributes to the emergence of an increasingly coherent model of the human mind as an eager predictive processor of sensory input. Embedded in the constant flux of time, the mind is continually forced to evaluate and recombine retrospective and prospective cues according to their immediate usefulness, and we hypothesize that sequential input in such varied domains as language, music, and visual action sequences are all subject to the constraints arising from this mental machinery. Notes 1. For more information about the IDyOM implementation and parameters, see the Supplemental Material.

Transparency
2. The robustness of this analysis was confirmed by fitting linear mixed-effects models using the glmmTMB package in R (Brooks et al., 2017) with entropy, boundary status, entropy by boundary status, melody length, and trial index (within the experimental session) as fixed effects and random intercepts for participant and melody. These analyses showed a consistent, significant effect of entropy and are reported in greater detail in Section S3 in the Supplemental Material.