When listening to speech sounds, listeners are able to exploit acoustic features that mark the boundaries between successive words, the so-called segmentation cues. These cues are typically investigated by directly manipulating features that are hypothetically related to segmentation. The current study uses a different approach based on reverse correlation, where the stimulus manipulations are based on minimal assumptions. The method was evaluated using pairs of phonemically identical sentences in French, whose prosody was changed by introducing random f0 trajectories and segment durations. Our results support a prominent perceptual role of the f0 rise and vowel duration at the beginning of content words.
1. Introduction
A substantial part of the speech perception literature is devoted to the identification of perceptual cues that are conveyed in speech sounds and decoded by the human auditory system. In particular, segmentation cues correspond to the acoustic features that mark the boundaries between successive words in a continuous sound stream. The present study reports an experimental method for assessing these cues, considering only minimal assumptions about their location.
Several sources of information are used by the human brain to find the boundaries between words. In particular, the phonetic and lexical contexts undoubtedly play a role in this process. In the extreme case, lexical competition models such as TRACE1 consider segmentation as a mere by-product of lexical access, where the selected set of boundaries is the one that forms a sequence of valid words. At the phonetic level, phonotactic rules impose constraints on the position of word boundaries in a sequence of phonemes. For example, /mr/ and /ls/ are not permissible sequences at the onset of words in Dutch.2 However, the lexical and phonetic levels are not sufficient to account for the ability of humans to segment speech. For instance, listeners are able to discriminate between phonemically identical sentences, such as “c'est l'amie” and “c'est la mie” in French3 or “known ocean” and “no notion” in English,4 showing that they can also exploit fine-grained acoustic cues available in the speech sounds.
The acoustic features correlated with the presence of word boundaries are language-specific and multidimensional. In French, they are mainly related to the fundamental frequency (f0) trajectories and the duration of syllables. Due to the multiple and partly redundant features conveyed in natural speech, researchers have mainly followed a hypothesis-driven approach to identify the segmentation cues, using carefully controlled stimuli where one single cue is varied at a time.
The study by Banel and Bacri5 is an example of such an approach. Separately recorded monosyllables were assembled together into ambiguous disyllabic stimuli like /ba.ɡaʒ/. The authors found that participants were more likely to hear one (“bagage”) or two words (“bas gage”) depending on whether the second or the first syllable was lengthened, respectively. This result suggests that listeners are sensitive to segmental duration and use this information to modulate the lexical interpretation of spoken French. However, the use of concatenated monosyllables does not allow us to dissociate the perceptual role of segmental duration from the role of other covarying cues. More recently, Shoemaker6 tackled this limitation by using stimuli where only the duration dimension was manipulated, while ensuring that all other stimulus features remained unchanged. She confirmed the previous finding that French listeners can exploit segmental duration as a cue for the interpretation of ambiguous phonemically identical sentences.
In parallel, other researchers have studied the role of the f0 trajectory in the segmentation of an input stream into words. Based on the study of pitch patterns from recorded utterances, sophisticated theories have been developed to explain the complex structure of f0 fluctuations in French.7 The role of these hypothetical cues during speech comprehension was then confirmed using perceptual experiments carried with manipulated stimuli. In particular, Spinelli et al.8 tested the use of a word-initial (“early”) f0 rise as a segmentation cue by native French listeners. Based on ambiguous phoneme sequences arising from elision, such as /se.la.mi/ (“c'est la mie” or “c'est l'amie”), they generated new stimuli where the mean f0 on the /a/ segment was artificially modified. This variable was shown to influence the perceived segmentation of the sentence, with a higher f0 leading to more “c'est l'amie” answers. This result supported the view of the early f0 rise as a cue to a content word beginning. A similar finding had been obtained by Welby using nonsensical sentences,9 where the manipulation process was applied to the whole f0 contour of the stimuli. Consistent with the notion of early f0 rise, Welby demonstrated that listeners interpreted nonsense sequences like /me.la.m∼ɔ.din/ as a single nonword (“mélamondine”) when the f0 rise occurred in the first syllable and as a determiner followed by a nonword (“mes lamondines”) when the rise occurred in the second syllable. However, that study also indicated that the early f0 rise was not a necessary cue and that a simple inflection or “early elbow” in the f0 contour was also interpreted as a marker for word beginning.
The above-mentioned studies employed a variety of psycholinguistic methods to investigate the role of different cues in the segmentation process, including cross-splicing,3–5 duration manipulation,6 and manipulation of the f0 information in one phonetic segment8 or in more complex f0 contours.9 An argument in favor of these manipulation-based experiments is that the speech features can be carefully controlled. However, the major disadvantage is that they require prior knowledge of the specific cues that can play a role in the segmentation process. Consequently, they cannot probe the large space of all possible cues available to the listeners, undermining the possibility to investigate potential interactions between cues.
In the present study, we introduce a new method for the investigation of segmentation cues in perception, based on the reverse-correlation (revcorr) paradigm. Contrary to the previously described experiments, the proposed method offers a comprehensive approach to explore the cues to word boundaries, where the test stimuli are processed with minimal assumptions. In this sense, the method is efficient, because it allows the investigation of several acoustic dimensions at the same time.
The revcorr approach consists in introducing random fluctuations into a stimulus and measuring how specific patterns of fluctuations affect recognition. In psycholinguistics, revcorr has been used to reveal the acoustic cues underlying phoneme identification10,11 or the prosodic cues to paralinguistic information, such as the intention and emotional state of the speaker.12,13 In the latter case, participants are asked to make sense of utterances resynthesized with a random prosody. For example, Ponsot et al.12 used a voice-processing algorithm to systematically randomize the prosody of existing speech recordings,14 asking their participants to judge the newly generated sounds along a predefined criterion of trustworthiness. The authors then associated the specific random prosody with the corresponding behavioral responses by computing the difference between the mean pitch contour of voices classified as trustworthy and non-trustworthy. This analysis returned a first-order kernel, interpreted as a “mental template,” that is able to capture which aspects of the random prosody critically affected the participants' responses.
The objective of the experiment reported here was to evaluate the efficiency of a prosody revcorr protocol in the study of segmentation cues. Using a segmentation task based on the work by Spinelli et al.,3 we measured the perceptual kernels of two pairs of phonemically identical sentences for a group of participants.
2. Materials and methods
The segmentation task was implemented as a listening experiment using the fastACI toolbox.15,16 In each condition, the test sounds were two phonemically identical sentences whose details are given next. Complementary details for replicating the experiment or reproducing any of the analyses presented in this study are given in the supplementary material.
2.1 Target sentences
Four ambiguous sentences from Spinelli et al.3 were used as targets in the present experiment. The sounds are recordings of the sentences “c'est l'amie/la mie” (“this is the friend/the crumb,” condition LAMI) or “c'est l'appel/la pelle” (“this is the call/the shovel,” condition LAPEL), all uttered by a female speaker with an average f0 of 210.8 Hz (LAMI) and 216.5 (LAPEL). In each condition, the two test sentences were aligned temporally and set to have the same total duration and root-mean-square level. The temporal alignment was done manually to roughly match the phonetic segments of each sentence pair. The spectrograms of all sentences are shown in Figs. 1(A) and 1(B).
Targets and mean results for the segmentation task in the LAMI (left panels) and LAPEL conditions (right panels). Top panels [(A) and (B)]: Spectrograms of test sentences. These spectrograms have a constant frequency (vertical) scaling in the equivalent-rectangular bandwidth (ERB) scale. Bottom panels [(C) and (D)]: f0 and time kernels averaged across all participants. Error bars correspond to ± 1.96 standard error of the mean (SEM), and red markers (and asterisks) indicate the weights that are significantly different from zero (significance threshold p < 0.005).
Targets and mean results for the segmentation task in the LAMI (left panels) and LAPEL conditions (right panels). Top panels [(A) and (B)]: Spectrograms of test sentences. These spectrograms have a constant frequency (vertical) scaling in the equivalent-rectangular bandwidth (ERB) scale. Bottom panels [(C) and (D)]: f0 and time kernels averaged across all participants. Error bars correspond to ± 1.96 standard error of the mean (SEM), and red markers (and asterisks) indicate the weights that are significantly different from zero (significance threshold p < 0.005).
2.2 Stimuli preparation: Prosody resynthesis
The target sentences were divided into segments of 100 ms, irrespective of their phonetic content, as indicated by the vertical dashed lines in Fig. 1. In each trial, one of the two targets was randomly selected and resynthesized with a new prosody, using the WORLD toolbox.17 A random f0 vector with values (with ) was drawn from a Gaussian distribution with a mean f0 of 210.8 Hz (LAMI) or 216.5 Hz (LAPEL) and a standard deviation (SD) of 100 cents (one musical semitone), with the constraint that new f0 values should not deviate by more than 2.2 times the SD (i.e., 220 cents) from the corresponding mean f0. For each 100-ms long segment, the original f0 trajectory of the sentence was replaced by a monotonic f0 transition between the two random f0 values of the corresponding segment edges. Simultaneously, the edge positions (excluding the first and last edges) were shifted by a random Gaussian value (zero mean and SD = 15 ms, bounded to ±2.2 SD), leading to stretched segment durations with an SD of about 20 ms. These shifts were used by WORLD to resynthesize sentences with randomly compressed or stretched segments, which were eventually used in the segmentation experiment. This process was repeated 400 times for each sentence, resulting in speech sounds (800 per condition) with unreliable duration cues and entirely neutralized f0 cues, while keeping the natural intensity and formants of the original sentences. This choice of resynthesis parameters is in line with previous prosody revcorr studies.12,13
2.3 Participants and experimental protocol
Data were collected for two independent groups of 16 and 18 participants for the LAMI and LAPEL conditions, respectively. The participants were all students of the bachelor program in psychology from Grenoble Alpes University. All students had self-reported normal hearing and were rewarded with course credits for their participation in the experiment.
The listening experiment was implemented using a one-interval two-alternative forced choice paradigm. In each trial, the participant was instructed to indicate whether the presented stimulus was “l'aX” (“l'amie” or “l'appel,” option 1) or “la X” (“la mie” or “la pelle,” option 2). Feedback was provided after each trial. Each experiment consisted of two sessions of 400 trials, with a short training preceding the first experimental session. The stimuli were presented at a comfortable level, with a random roving—drawn from a uniform distribution between ±2.5 dB—to partly discourage the use of absolute loudness cues during the task.
2.4 Analysis
3. Results and discussion
3.1 Psychophysical kernels
The derived f0 and time kernels averaged across all participants (N = 16 for LAMI, N = 18 for LAPEL) are shown in Figs. 1(C) and 1(D) (left for LAMI, right for LAPEL), respectively. The kernels provide a visualization of the listening strategy of the participants, i.e., they emphasize the prosody aspects they relied on during the task. For instance, the above-mentioned effect of the f0 shift on /a/ is captured by the GLM, which attributes a significant positive weight to the f0 information on the 0.5-s segment edge in LAMI and on the 0.4-s segment edge in LAPEL [Fig. 1(C)].
In the LAMI condition, the f0 kernel [Fig. 1(C), left] reveals three critical regions, significantly different from zero at the group level with a positive weight at 0.5 s [ ] surrounded by negative weights at 0.3 s [ ] and 0.6 s [ ]. This indicates that when presented with a high f0 at 0.5 s and/or low f0 at 0.3 and 0.6 s, the participants were more likely to perceive the stimulus as “c'est l'amie.” On the contrary, the opposite patterns of f0 elicit more “c'est la mie” responses. In the time kernel [Fig. 1(D), left], the weights associated with the segment edges at 0.3 and 0.6 s were found to be significant [segment at 0.3 s: ; segment at 0.6 s: ].
In the LAPEL condition, the f0 kernel [Fig. 1(C), right] was found to only have one critical region with a positive weight at 0.4 s [ ]. This indicates that when presented with a high f0 at 0.4 s, the participants were more likely to perceive the stimulus as “c'est l'appel.” In the time kernel [Fig. 1(D), right], the weight associated with the segment edge at 0.5 s was found to significantly bias the participants' response toward “la pelle” [ ].
An indication of the kernels' goodness of fit can be obtained from Eq. (1) and the average kernels from Fig. 1, resulting in response predictions significantly above chance for both conditions with respect to the participants' data [55.7% (SD = 4.8%) for LAMI; 54.5% (SD = 2.2%) for LAPEL] (further details can be found in the supplementary material).
3.2 Interpretation of the significant weights
The positive f0 weight in the two test conditions [Fig. 1(C), edges at 0.5 and 0.4 s for LAMI and LAPEL, respectively] coincides with the segment containing the vowel /a/. As also shown in Fig. 2, this positive weight indicates that a higher f0 in this segment is more likely related to a “l'aX” response, supporting the notion of early f0 rise cue for segmentation.3,9 The significant weight at time 0.6 s of the LAMI f0 kernel [Fig. 1(C), not visible for LAPEL] can be similarly interpreted in terms of the early f0 rise. This segment roughly corresponds to the /i/ vowel in the targets. The negative f0 weight indicates that a downward f0 shift elicits more “l'amie” responses or, in other words, that an upward f0 shift elicits more “la mie” responses. Consequently, an /i/ vowel with a high f0 supports the segmentation at the syllable that contains it, resulting in the consonant-initial word “mie.”
Proportion of “l'amie” (panel (A)) and “l'appel” (panel (B)) responses, displayed as a function of the random f0 shift at t = 0.5 s (A) and t = 0.4 s (B), corresponding to the temporal position the vowel /a/. Error bars correspond to ±1.96 SEM, comparable to 95% confidence intervals under the assumption of normality. The dotted line indicates the absence of bias, where “l'aX” and “la X” were equally chosen by the participants. The shaded area corresponds to the Gaussian distribution from which the f0 shifts were drawn (no unit).
Proportion of “l'amie” (panel (A)) and “l'appel” (panel (B)) responses, displayed as a function of the random f0 shift at t = 0.5 s (A) and t = 0.4 s (B), corresponding to the temporal position the vowel /a/. Error bars correspond to ±1.96 SEM, comparable to 95% confidence intervals under the assumption of normality. The dotted line indicates the absence of bias, where “l'aX” and “la X” were equally chosen by the participants. The shaded area corresponds to the Gaussian distribution from which the f0 shifts were drawn (no unit).
The significant weight at time 0.3 s [Fig. 1(D), /e/ vowel in the LAMI targets, not visible for LAPEL] is harder to interpret. This weight may correspond to a form of reference point for the assessment of the f0 height on the subsequent /a/ vowel. The explanation could be related to the “early elbow,”9 where the presence of an inflection in the f0 contour is equivalent, from an intonational perspective, to an early f0 rise cue. In our case, a heightened f0 on /e/ creates an artificial elbow in the f0 contour, which in turn influences the segmentation process.
For the time kernels [Fig. 1(D)], the significant weights are located at times 0.3 and 0.6 s for LAMI and at time 0.5 s for LAPEL, indicating that participants relied on the relative durations of the first and second syllables. This is in line with the durational differences observed in the production of “l'aX” and “la X” sentences, where the first syllable (/la/) was found to be shorter in “la X” than in “l'aX,” while the reverse was true for the second syllable.3 In our experiment, a shift of the 0.6 s (or 0.5 s) segment edge affects the perceived duration of the /mi/ (or /pel/) syllable and, hence, the segmentation process. Also, in line with Spinelli et al.'s observations, the positive weight at time 0.3 s (only significant for LAMI) indicates that longer /la/ segments (and, thus, shorter /e/ segments) are associated with more “l'amie” responses. More generally, the measured time kernels support the conclusions that listeners use segmental durations near word boundaries as a cue for segmentation.5,6
3.3 Performance in the task
The prosody resynthesis method (Sec. 2.2) was designed to neutralize the prosody cues of the original target sentences, reducing the performance in the segmentation task from an expected score of 75% (Spinelli et al.,8 ambiguous condition) to an average percentage of correct responses close to chance [57.3% (SD = 6.7%) for LAMI; 49.3% (SD = 4.3%) for LAPEL], despite the presence of intensity and formant cues that remained available in the stimuli. Nevertheless, this performance does not imply that the participants were responding at random. Given the strong f0 weighting shown in Fig. 1 for /a/ segments at 0.5 s (LAMI) and 0.4 s (LAPEL), we depict the percentage of “l'aX” responses as a function of the random f0 shift in those segments, irrespective of the fluctuations in the other segments. The monotonically increasing proportions (blue traces in Fig. 2) indicate that shifts closer to –200 or 200 cents systematically biased the participants' answers toward “la X” or “l'aX,” respectively. These curves suggest that the segmentation judgments were driven by the random prosody imposed on the speech target, and not by the target itself. The fact that a null f0 shift led to an overall score of 50% indicates that there was no bias in favor of one response or the other.
The finding that raising the f0 in the /a/ vowel leads to more vowel-initial segmentation [with increasing with in both panels of Fig. 2] is consistent with the results obtained by Spinelli et al.8 (see their Fig. 4). This observation is usually interpreted as evidence for the role of the early f0 rise in French, where the presence of a heightened f0 on the first syllable of a content word is used as a perceptual cue by the listener to segment this word.8,9
4. Conclusion and outlook
The revcorr approach allows us to test different prosody dimensions (f0 and duration cues) at the same time. Although the stimulus preparation was based on a resynthesis algorithm, the set of parameters used in the algorithm did not require a priori assumptions about the location of the cues under investigation, contrary to more traditional hypothesis-driven approaches.
Our results show that first-order kernels can be derived for a segmentation experiment using a prosodic revcorr approach. The estimated weights can be interpreted by relating the arbitrary time positions of the segment edges to the phonetic content of the target speech sounds. To show that our observations did not critically depend on the specific set of target sentences or the adopted resynthesis parameters, we tested three additional control conditions in a reduced group of participants. All details of these extra conditions are presented in the supplementary material. These tests showed that (1) our conclusions are not specific to the selected target sentences or the arbitrary segment edge positions (supplemental Fig. 1), and (2) the target-specific kernels within each tested sentence pair are very similar (supplemental Fig. 2), indicating that potential subtle acoustic differences between contrasting sentences did not strongly influence our main results.
It is important to note that, although the kernels are presented as contours, the prosody revcorr method assesses the effect of f0 and timing at each segment edge independent of the other segment edges. However, the auditory system does not process information from different time points independently. For this reason, we should refrain from interpreting kernels as “prosodic prototypes” in general. Nevertheless, as a means of confirmation, we resynthesized the targets by employing the f0 and time kernels as prosodic contours. As expected, this resynthesis indeed led to highly confusing stimuli.
In general, the prosodic revcorr method presented in this study showed high replicability across sentences (Fig. 1; supplemental Fig. 1), across targets (supplemental Fig. 2), and across segment edge positions (Fig. 1, left; supplemental Fig. 1, right). Despite the modest sample size in this study, our results suggest that the estimated kernels truly reflect a general “segmentation strategy” of the listeners in the task. Consequently, the prosody revcorr approach seems to be well suited for the investigation of psycholinguistic processes. Compared to previous methods that rely on the manipulation of a single cue, the prosody revcorr method requires minimal assumptions about the potential cues that participants could use. Probing the large space of all possible cues available to the listeners is less time-consuming and can also allow the investigation of potential interactions between cues.
SUPPLEMENTARY MATERIAL
See supplementary material at https://doi.org/10.1121/10.0021022 for the evaluation of the prosody revcorr paradigm in control conditions using additional pairs of sentences.
Acknowledgments
This study was supported by the French National Research agency through the grants fastACI (Grant No. ANR-20-CE28-0004; A.O., L.V.), FrontCog (Grant No. ANR-17-EURE-0017; A.O., L.V.), CeLyA (Grant No. ANR-10-LABX-0060; E.G.), and IDEXLYON (Grant No. 16-IDEX-0005; E.G.).
AUTHOR DECLARATIONS
Conflict of interest
The authors declare they have no conflicts of interest.