The speech signal is inherently variable and listeners need to recalibrate when local, short-term distributions of acoustic dimensions deviate from long-term representation. The present experiment investigated the specificity of this perceptual adjustment, addressing whether the perceptual system is capable of tracking differing simultaneous short-term acoustic distributions of the same speech categories, conditioned by context. The results indicated that instead of aggregating over the contextual variation, listeners tracked separate distributional statistics for instances of speech categories experienced in different phonetic/lexical contexts, suggesting that perceptual learning is not only influenced by distributional statistics, but also by external factors such as contextual information.
1. Introduction
Speech perception shows remarkable sensitivity and rapid adaptation to changes in the short-term statistics of the speech signal. Recent research demonstrates that the relationship between experienced acoustic dimensions and phonetic categories is influenced by the evolving distribution of acoustic regularities [e.g., Idemaru and Holt (2011)], as well as factors like lexical information [e.g., Norris et al. (2003)], talker information [e.g., Theodore and Miller (2010)], visual information from the speaker's lip movement [e.g., Bertelson et al. (2003)], and the degree of ambiguity in the input [e.g., Babel et al. (2019)]. The adaptive nature of phonetic perception is likely what aids listeners in accommodating the systematic variability of the acoustic signal [e.g., Liberman et al. (1967)] and achieve stable and consistent phonetic recognition.
Our prior work focused on the extent to which the relationship between acoustic dimensions and phonetic categories is influenced by changing distributions of acoustic regularities of English stop voicing categories [e.g., [p] versus [b], Idemaru and Holt (2011, 2014, 2020)]. Typical English voiceless stops in the onset position show a positive correlation between voice onset time (VOT) and fundamental frequency (F0) of the following vowel (Abramson and Lisker, 1985). This is reflected in perception such that while VOT exerts a robust influence as the primary cue, F0 functions as a secondary cue to the distinction (Whalen, et al., 1993): when F0 is higher, listeners are more likely to hear a stop as voiceless if VOT is equal. The speech materials in our work included a manipulation of the ways that acoustic dimensions (VOT and F0) were mapped to phonetic categories. Whereas typical English stops show a positive correlation between VOT and F0 (called “canonical correlation”), we also introduced stops with a negative correlation (called “reversed correlation”): voiceless stops with longer VOTs have lower F0 and voiced stops with shorter VOTs have higher F0 on the following vowel. This type of difference in cue weighting is designed to be similar to the type listeners might encounter when hearing an unfamiliar L2 accent, for example. We have demonstrated that short-term exposure to this reversed F0/VOT relationship in words like beer, pier, deer, and tear causes listeners to downweight the role of the F0 dimension, diminishing its influence on perception [e.g., Idemaru and Holt (2011)].
Our prior research also showed a considerable specificity of this perceptual learning. Learning to downweight the F0 dimension through exposure to the reversed correlation in stops at one place of articulation (e.g., [b] and [p]) did not generalize to stops at another place {i.e., [d] and [t]} (Idemaru and Holt, 2014). Listeners can even maintain distinct patterns of F0 weight simultaneously across [b]/[p] categorization and [d]/[t] categorization (i.e., robust reliance in one, and downweighting in the other) when exposed to distinct F0/VOT correlation statistics across these stop place categories (Idemaru and Holt, 2014). Furthermore, we have recently found that this learning does not robustly generalize from a pair of [b] and [p] words to another pair; learning to downweight F0 with reversed correlation beer and pier generalized to bear and pear, but the extent of generalization was reduced considerably even though the generalization tokens possessed the identical values of VOT and F0 and were spoken by the same talker (Idemaru and Holt, 2020). If dimension-based statistical learning fully operates at the level of speech categories (e.g., [b] and [p]), the learning should generalize robustly from one instance to the other.
The fact that generalization is partial from a pair of [b] and [p] words to another pair indicates that there are factors outside of speech categories that influence this learning. There is evidence that such factors include talker-related information. For example, Zhang and Holt (2018) showed that listeners can maintain separate patterns of F0 weight simultaneously when distinct F0/VOT correlation statistics are paired with different voices or silent videos depicting different talkers. The results suggest that the parsing of separate distributional statistics can be cued by information related to talker identity. However, the finding that exposure to one [b] did not completely generalize to another [b] spoken by the same talker (Idemaru and Holt, 2020) suggests that separate parsing and tracking of cue statistics may occur even within a single speaker. The current study puts this possibility to a test by exposing listeners to distinct F0/VOT correlation statistics in [p] and [b] across two phonetic/lexical contexts (i.e., [−ɛəɹ] context, bear and pear, and [−ɪəɹ] context, beer and pier), possessing the identical VOT and F0 values and spoken by the same talker. In a single block, listeners experienced the canonical F0/VOT relationship in [b] and [p] across one pair of words (e.g., bear and pear), and the reversed relationship in [b] and [p] across another pair of words (e.g., beer and pier). Then, across blocks, this relationship reversed such that the F0/VOT correlations were now conditioned by word in the opposite way. If listeners track F0/VOT statistics separately depending on the phonetic/lexical context in which the speech categories appear, the reliance on F0 across blocks should be observed in opposing directions for the two pairs of words. If, however, listeners track correlation statistics more generally for [b]'s and [p]'s at the category level, then the opposing statistics should cancel each other out, resulting in no observed relationship between F0 and VOT, and no modulation of F0 weight due to the context or block.
2. Method
Please see the supplementary material for further details of stimulus creation, number of experimental trials, task instruction, data, and analysis models.1
2.1 Participants
Forty native-English listeners (female = 24, Mean age = 19.7) with normal hearing participated. They were all university students who received partial course credit for participation. The participants were randomly assigned to Bear-CRC/Beer-RCR condition (n = 20) and Bear-RCR/Beer-CRC condition (n = 20) in the adaptation task. See Sec. 2.4 below for the description of the conditions.
2.2 Stimuli
Stimuli from Idemaru and Holt (2020) served as stimuli in this experiment. The bear-pear and beer-pier stimuli with VOT values of −5, 0, 5, 10, 15, 25, 35, 40, 45, 50, and 55 ms were selected. The onset F0 of the stimuli varied from 170 to 190 Hz (low F0s), and from 240 to 260 Hz (high F0s) in three 10-Hz steps. From the onset, the F0 decreased to 150 Hz at the end of the word. All the stimuli were scaled to 75 dB.
2.3 Baseline task
Prior to the adaptation task, listeners categorized bear-pear and beer-pier continua to measure the baseline perceptual weight of F0 in voicing categorization. Stimuli in the baseline task varied along VOT in seven steps (−5, 5, 15, 25, 35, 45, and 55 ms) and along F0 at two levels (180 and 250 Hz). These stimuli were presented in random order 10 times each, except for the middle 25 ms VOT tokens which were presented 20 times for both bear-pear and beer-pier due to a programming error. The stimuli were blocked for bear-pear and beer-pier with the block order counterbalanced across participants. The programming error increased the frequency of the ambiguous part of the VOT distribution, which could potentially enhance the influence of F0. However, we proceeded with the analysis based on two reasons. First, this occurred for both word pairs; thus, if there was any influence of the additional ambiguous VOT tokens, it occurred to both bear-pear and beer-pier categorization, the comparison of which was the general focus of this study. Second, the results of the baseline task were not part of the critical observation in this study, which is the modulation of the effect of F0 across Canonical and Reverse blocks across the two word pairs presenting competing cue statistics in the adaptation task.
Seated in front of a computer monitor in a sound-attenuated room, participants heard a spoken word presented diotically over headphones (Beyer DT-150) and saw a simultaneous display of response choices, visual icons of a bear and a pear, or those of a beer and a pier, on the monitor. The experiment was controlled by E-Prime experiment software (Psychology Software Tools, 2012).
2.4 Adaptation task
Immediately following the baseline task, the adaptation task exposed listeners to exposure stimuli in canonical and reversed F0/VOT correlation blocks, and monitored listeners' reliance on F0 in categorizing the test stimuli embedded in each block that had an ambiguous VOT value. As shown in Fig. 1(a), exposure stimuli (gray cells) always had perceptually unambiguous VOT values signaling the voicing categories. However, the relationship between the VOT dimension and the F0 dimension changed across the three experiment blocks, exposing listeners to short-term deviations in the F0/VOT correlation in the reverse correlation block [Fig. 1(b)].
(a) F0/VOT correlation in stimuli across Canonical and Reverse blocks. Gray cells are exposure stimuli and black cells are test stimuli. A single asterisk indicates the intended stimuli, and double asterisks indicate the stimuli actually used, due to a programming error. (b) F0/VOT correlation patterns across the two experimental conditions.
(a) F0/VOT correlation in stimuli across Canonical and Reverse blocks. Gray cells are exposure stimuli and black cells are test stimuli. A single asterisk indicates the intended stimuli, and double asterisks indicate the stimuli actually used, due to a programming error. (b) F0/VOT correlation patterns across the two experimental conditions.
Due to a programming error, 50 ms VOT stimuli for the canonical block and 10 ms VOT stimuli for the reversed blocks had 240 Hz F0 [marked with double asterisks in Fig. 1(a)] instead of the intended 250 Hz F0 [marked with an asterisk in Fig. 1(a)]. After consideration of this error, we chose to proceed with the analysis and interpretation of the results for several reasons. First, inspection of responses to exposure stimuli indicated that responses to the stimuli in question [240 Hz rather than 250 Hz, marked with double asterisks in Fig. 1(a)] did not show noticeable differences from responses to other high F0 stimuli. Next, if there was any influence of this error, it would be to undermine the signal strength of the high F0 stimuli, since the high F0 stimuli had 248 Hz on average instead of the intended 250 Hz average. This lowered signal strength in turn would have reduced changes in the influence of F0 across Canonical and Reverse blocks for listeners. However, we still obtained reliable modulation of F0 influence across blocks (see 3.2).
Critically for this experiment, listeners heard opposing F0/VOT statistics for bear-pear and beer-pier stimuli within the same block across three exposure blocks [Fig. 1(b)]. The pattern of F0/VOT correlation across pairs of words and blocks was counterbalanced in two conditions, Bear-CRC/Beer-RCR and Bear-RCR/Beer-CRC. In each condition, 20 unique exposure stimuli with perceptually unambiguous VOTs were presented 10 times per block in random order. To assess listeners' sensitivity to changes in the F0/VOT correlation, test stimuli with perceptually ambiguous VOT values [black cells in Fig. 1(a)] were interspersed randomly among the exposure stimuli throughout the experiment. The VOT-neutral test stimuli for each word pair at each of the two F0 values (180 and 250 Hz) were presented 10 times per block.
This design resulted in a total of 600 exposure trials and 120 test trials, consistent with our prior studies (Idemaru and Holt, 2011, 2014, 2020). The test stimuli were not differentiated from exposure by task or instructions. The blocks were not described to participants, and the experiment was not visibly divided into separate blocks. Thus, 720 identical-looking trials proceeded continuously as listeners performed the word recognition task. The apparatus and procedure were identical to those of the baseline task, except that the visual icons presented on the monitor as response choices included all four pictures: a bear, a pear, a beer, and a pier.
3. Analysis and results
All analyses presented in this study were performed using mixed effect logistic regression as implemented in the lme4 package (Bates et al., 2015) in the r environment (R Core Team, 2014). The data and analysis code are available online (Idemaru, 2020).
3.1 Baseline task
The full model analyzing listeners' responses in the baseline task included VOT (continuous, centered), F0 (categorical, sum-coded with low F0, 180 Hz, as the reference), phonetic/lexical context (called “Context”; categorical, sum-coded with [−ɛəɹ] as the reference), counterbalance condition (called “Condition”; categorical, sum-coded with Bear-CRC/Beer-RCR as the reference), and their interactions as fixed effects. We expected effects of VOT and F0 on voicing perception. We included Context and Condition to test the effects of VOT and F0 across words and conditions. The model included a random intercept for Listener, and random slopes for VOT, F0, Context, and their interactions over Listener. Intercepts and random slopes were uncorrelated to aid model convergence. As this model indicated a significant four-way interaction (VOT: = −0.01, SE = 0.01, z = −2.63, p = 0.01), we ran models separately for [−ɛəɹ] and [−ɪəɹ] contexts. Findings are illustrated in Fig. 2(a).
(a) Categorization of bear-pear (top) and beer-pier (bottom) continua by listeners in the Bear-CRC/Beer-RCR (left) and Bear-RCR/Beer-CRC (right) conditions in the baseline task. (b) Logit of [p] responses to VOT-ambiguous bear/pear (top) and beer/pier (bottom) test stimuli across three experimental blocks for the Bear-CRC/Beer-RCR (left) and Bear-RCR/Beer-CRC (right) conditions. Error bars indicate the 95% confidence interval of the mean.
(a) Categorization of bear-pear (top) and beer-pier (bottom) continua by listeners in the Bear-CRC/Beer-RCR (left) and Bear-RCR/Beer-CRC (right) conditions in the baseline task. (b) Logit of [p] responses to VOT-ambiguous bear/pear (top) and beer/pier (bottom) test stimuli across three experimental blocks for the Bear-CRC/Beer-RCR (left) and Bear-RCR/Beer-CRC (right) conditions. Error bars indicate the 95% confidence interval of the mean.
The results for the [−ɛəɹ] context indicated that both VOT and F0 influenced voicing categorization (VOT: = 0.21, SE = 0.03, z = 6.99, p < 0.001; F0: = 0.87, SE = 0.19, z = 4.63, p < 0.001). The analysis for the [−ɪəɹ] context likewise indicated that both VOT and F0 influenced voicing (VOT: = 0.16, SE = 0.01, z = 0.52, p < 0.001; F0: = 2.20, SE = 0.20, z = 10.73, p < 0.001), and the three-way interaction was significant (VOT*F0*Condition: = −0.03, SE = 0.01, z = −2.23, p = 0.03), indicating a greater F0 effect at the voiced end of VOT for Bear-RCR/Beer-CRC compared to Bear-CRC/Beer-RCR. This was likely due to the unexpected bump in voiceless responses to the high F0 – 0 ms VOT stimuli [Fig. 2(a)]. Visual inspection of these two stimuli indicated that the burst was shorter than in the other stimuli. In spite of this problem, the overall results of the baseline task demonstrated that listeners in both groups showed an influence of F0 prior to the adaptation task in categorizing the stimuli with 25 ms VOT, the critical ambiguous VOT value used in the adaptation task.
Although the more frequent presentation of the ambiguous VOT value (20 times) compared to other VOT values (10 times each) may have had some influence on the results, it is highly unlikely that it is the only explanation that F0 influenced categorization with this VOT value, given the robust effect of F0 on voicing categorization previously reported (Idemaru and Holt, 2011, 2014, 2020). Thus, as the issues with stimuli do not present detrimental consequences on the interpretation of the overall results of the baseline or adaptation tasks, we proceeded with the study.
3.2 Adaptation task
Listeners' responses to VOT-neutral test stimuli during exposure to the changing F0/VOT correlation in the adaptation task are illustrated in Fig. 2(b). Because the block design was different across two conditions, separate models were run for each condition. The model for this analysis included F0 (categorical, sum-coded with low F0 as the reference), Block (categorical, treatment-coded with the second block as the reference), Context (categorical, sum-coded with [−ɛəɹ] as the reference), and their interactions as the fixed effects. The model included a random intercept for Listener. The random slopes for F0, Block, and Context and their interactions were examined, and those indicating variances close to zero and thus resulting in issues of overfitting were removed. Random effects were also uncorrelated to aid issues of overfitting. A significant F0 × Block interaction in this analysis would indicate a change of reliance on F0 dimension across blocks, evidence of dimension-based statistical learning.
For Bear-CRC/Beer-RCR, given a significant three-way interaction (F0*Block1*Context: = −2.52, SE = 0.51, z = −4.97, p < 0.001; F0*Block3*Context: = −3.20, SE = 0.51, z = −6.33, p < 0.001), we examined responses to [−ɛəɹ] and [−ɪəɹ] contexts separately to determine whether learning occurred in each context. The F0 x Block interaction was significant in both [−ɛəɹ] and [−ɪəɹ] contexts ([−ɛəɹ], F0*Block1: = 0.93, SE = 0.33, z = 2.79, p < 0.01; F0*Block3: = 1.17, SE = 0.33, z = 3.35, p < 0.001; [−ɪəɹ], F0*Block1: = −1.61, SE = 0.39, z = −4.15, p < 0.001; F0*Block3: = −2.03, SE = 0.38, z = −5.28, p < 0.001). These results indicate that F0 was downweighted from Block 1 (Canonical) to 2 (Reversed), and then upweighted from Block 2 to 3 (Canonical) for [b/p] categorization in the [−ɛəɹ] context, while at the same time F0 was upweighted from Block 1 (Reversed) to 2 (Canonical), and downweighted from Block 2 to 3 (Reversed) for [b/p] categorization in the [−ɪəɹ] context. The results for Bear-RCR/Beer-CRC also confirmed separate tracking of stop voicing per context. Given a significant three-way interaction (F0*Block1*Context: = 2.20, SE = 0.49, z = 4.73, p < 0.001; F0*Block3*Context: = 2.33, SE = 0.49, z = 4.73, p < 0.001), separate analyses examined responses in [−ɛəɹ] and [−ɪəɹ] contexts. The F0 × Block interaction was significant across both contexts ([−ɛəɹ], F0*Block1: = −0.89, SE = 0.35, z = −2.56, p = 0.011; F0*Block3: = −1.35, SE = 0.35, z = −3.90, p < 0.001; [−ɪəɹ], F0*Block1: = 1.28, SE = 0.39, z = 3.32, p < 0.01; F0*Block3: = 1.00, SE = 0.35, z = 2.88, p < 0.01). Listeners in both Bear-CRC/Beer-RCR and Bear-RCR/Beer-CRC conditions showed separate tracking of F0/VOT cue correlations for the same sounds, [b/p], across two phonetic/lexical contexts, [−ɛəɹ] and [−ɪəɹ].
4. Discussion
Speech categorization is adjusted dynamically in response to a short-term deviation from the norm temporarily experienced in the speech signal and context. Various information can guide such perceptual adjustment, including lexical information [e.g., Norris et al. (2003)], visual information [e.g., Bertelson et al. (2003)], and distributional information from the acoustics [e.g., Idemaru and Holt (2011)]. Whereas some of these studies reported generalizability of perceptual learning they investigated (Eisner and McQueen, 2005; Kraljic and Samuel, 2005, 2006, 2007; Theodore and Miller, 2010), others reported considerable specificity or lack of generalization (Maye and Gerken, 2001; Reinisch et al., 2014; Idemaru and Holt, 2014, 2020).
The current study put listeners to a stringent test of perceptual learning where the distributional statistics of VOT and F0 for the voicing contrast [b] versus [p] had opposing patterns across phonetic/lexical contexts, [−ɛəɹ] and [−ɪəɹ], simultaneously in the short-term input. The results indicated that listeners downweighted F0 in categorizing [b] and [p] in one context while simultaneously upweighting F0 in categorizing [b] and [p] in the other, mirroring the distinct distributional statistics of sounds associated with each context. Instead of tracking cue correlations at the level of voicing, aggregating distributional statistics globally across contexts, listeners showed remarkable perceptual sensitivity by tracking separate statistics across two instances of [b/p] spoken by the same talker. This means that the distribution of critical dimensions (i.e., VOT and F0) in the input signal is not the only factor that influences dimension-based statistical learning: external factors such as contextual information also have a role. Indeed, this context-specificity accords with findings in other domains of perception, described more below.
One of the mechanisms that enables the specific application of perceptual learning might be the fact that various contextual factors function to parse the incoming signal. Zhang and Holt (2018) showed that information related to talker identity (i.e., voice quality and video of faces) can support such parsing. The current results show that a signal-internal factor such as phonetic/lexical context (i.e., [−ɛər] and [−ɪər]) could also serve to parse local statistical distributions (though see discussion below). Taken together, these results suggest that the influence of contextual information on dimension-based statistical learning is not specifically about talker adaptation. Listeners are able to adapt to different acoustic regularities produced even by the same talker. After all, the perceptual system shows attunement to contextually based conditioning prevalent in production in other domains, such as sociolinguistic constraints (Vaughn and Kendall, 2018), and in compensation for coarticulation (Lotto et al., 1997). In fact, the system's ability to parse the signal (even the same categories like [b] and [p]) based on signal-external factors is perhaps involved in cases where generalization of perceptual learning is constrained (Maye and Gerken, 2001; Reinisch, et al., 2014; Idemaru and Holt, 2014, 2020).
We have described the contextual factor that influenced perceptual learning in this study as phonetic/lexical context because it is yet unclear what this influencing factor might be. It may be that diphone (i.e., [bɛ/pɛ] vs [bɪ/pɪ]), rather than phone, is the unit with which dimension-based statistical learning operates. It also may be that vowel category or acoustics of the vowel may influence this learning. Since F0, the secondary cue manipulated in the study, is a property of the vowel, it is possible that characteristics of the vowel are tightly linked to learning about stop voicing. Alternatively, it may be that lexical context serves to parse statistical distributions related to speech categories in the input. Finally, we cannot rule out the possibility that some listeners thought there were two talkers in our experiment, one who spoke with the Canonical correlation and the other who spoke with the Reversed pattern. Such an expectation could have helped those listeners parse the two patterns of input statistics more easily. Future work is needed to determine what contextual factors can influence such parsing, allowing for separate simultaneous tracking of multiple regularities in the speech signal.
See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0002762 for further details of stimulus creation, number of experimental trials, task instruction, and analysis models.