Surface-level phonetic details are used during word recognition. Yet, questions remain about how these details are encoded in lexical representations and the role of memory and attention during this process. The current study utilizes lexical repetition priming to examine the effect of a delay between hearing a word repeated with either the same or different coarticulatory patterns on lexical recognition. Listeners were faster to recognize repeated words with the same patterns of coarticulatory nasality, confirming that subphonemic information is encoded in the lexicon. Furthermore, when listeners had to adapt to more than one talker, greater coarticulatory specificity in delayed priming was observed suggesting that word-specific encoding of subphonemic details is an active cognitive process.
1. Introduction
This study explores aspects of the representational encoding of coarticulatory nasality. An English speaker's phonemic knowledge of the word den, e.g., that the final consonant is nasal, must be stored in her mind so she is able to produce and recognize this word effortlessly. Yet, there are open questions about whether the subphonemic coarticulatory patterns of a word's pronunciation are encoded in the mental lexicon. For example, the vowel in den is produced with coarticulatory nasalization due to an overlapping velum lowering gesture from the nasal consonant. Variable pronunciations of that word, either with less or more extensive overlapping nasality, would not change the identity of the word (i.e., it is subphonemic variation); yet, is it still information that listeners routinely store with its representation? We ask whether subphonemic coarticulatory variation is encoded in lexical representations and, furthermore, whether factors such as memory (i.e., increasing time between hearing repeated words) and attentional factors (i.e., hearing one or multiple talkers) influence encoding of fine-grained coarticulatory detail.
The representational status of coarticulatory nasality was explored by Lahiri and Marslen-Wilson (1991), where truncated [C] syllables with coarticulatory nasality (cut from CVN contexts) were played to English listeners. The vowels were classified a majority of the time as phonemically oral (i.e., CVC words), leading the authors to argue that word representations are “underspecified” and do not store coarticulatory details. However, Ohala and Ohala (1995), in revisiting this question, found robust evidence that [C] signals a CVN word and argued that phonetic patterns are encoded in lexical representations.
Whether subtler, fine-grained differences in degree of coarticulation are encoded in lexical representations is less well-understood. Beddor et al. (2013) found that listeners are quicker to recognize CVNC words when temporal extent of coarticulation starts earlier in the vowel, evidence that fine-grained differences in coarticulatory patterns are used during word perception as soon as they are available. But unanswered questions about how listeners encode these details remain. Findings in the realm of speech production suggest that subphonemic variations in degree of nasal coarticulation are stored with lexical representations. For instance, systematic variations in produced coarticulatory details has been found across lexical items (Scarborough, 2013), languages (Solé, 1995), and regional varieties (Tamminga and Zellou, 2015). Also, there is evidence for close phonetic imitation of coarticulatory patterns: Speakers who are exposed to CVN(C) words with varying degrees of coarticulatory nasality produced words with co-varying vowel nasality (Zellou et al., 2016). A better understanding of how coarticulatory variation is encoded in words can contribute to understanding how differences in speech patterns within and across speakers emerge and are diffused.
In the present study, we aim to explore two specific aspects of coarticulatory encoding. First, are subphonemic coarticulatory patterns stored in lexical representations? In other words, if listeners experience a speaker producing a word in a particular way, e.g., producing den with more or less coarticulatory nasality, will this influence how a variable pronunciation of that word is later processed? We hypothesize that listeners track and remember word-specific coarticulatory details, such that hearing a word repeated with different coarticulatory properties inhibits recognition, relative to hearing words with identical phonetic patterns. Second, is the process of encoding coarticulatory variation in lexical representations mediated by attentional factors? Whether speech recognition is a passive process that invariantly maps an acoustic signal directly onto linguistic representations (Lotto et al., 2009), or whether it is an active process that can be influenced by contextual and/or attentional factors is debated and not well understood (Heald and Nusbaum, 2014). We hypothesize that perceptual encoding of coarticulatory variation is attentionally mediated, such that under more challenging listening conditions, people will display more precise encoding of vowel-coarticulatory specificity heard in words. Exploring these questions can inform broader issues about the nature of representations and mechanisms underlying spoken word comprehension and how these factors might help explain observed patterns of phonetic variation within and across speakers.
1.1 Word-specific representation and memory
Listeners are faster in making a lexical decision to a heard word (the target) after prior exposure to that same word (the prime). This technique, lexical repetition priming, can reveal information about the form of a word stored in memory; When prime and target differ in acoustic characteristics, this facilitation is reduced (Church and Schacter, 1994). More relevantly, identical VOT durations on initial stops of prime-target pairs facilitate recognition of the repeated word, relative to pairs with mismatching VOT (Ju and Luce, 2006). This suggests that lexical representations store information about within-category variation. Thus, repetition priming can reveal the representational status of phonetic details for words. If there is no difference in facilitation between prime-target pairs that have different coarticulatory patterns from those that are the same, then coarticulatory detail is not stored in lexical representations. Yet, if same or different coarticulation word pairs yield different facilitation patterns, this would provide evidence that subphonemic coarticulatory patterns are stored with lexical forms.
1.2 Attentional factors
Another unresolved question is whether encoding of coarticulatory detail is an automatic, passive process or an active cognitive process. The perspective that the mapping of an acoustic signal onto invariant linguistic representations is an automatic, domain-general process comes from evidence that non-human animals exhibit similar auditory pattern-matching behaviors for speech (Lotto et al., 2009). Yet, humans display adaptive and cognitively active perceptual behaviors. An example of active phonetic processing is hearing speech produced by multiple talkers: listeners take longer to identify words in lists containing speech from multiple talkers than in lists with a single talker (Mullennix et al., 1989). This suggests that the act of switching between multiple talkers causes listeners to pay closer attention to the phonetic aspects of the signal. Further supporting evidence come from Magnuson and Nusbaum (1994) who exposed listeners to a synthesized speech passage with either one or two apparent talkers. Following the passage, all listeners completed recognition trials with same- or different-pitch items. Listeners who had been exposed to two talkers responded slower to the different-pitch items than listeners who heard only one talker. This suggests that actively adjusting to speaker variability draws listener attention more closely to the speech signal and subsequently enhances the encoding of phonetic detail on lexical representations. If recognition of coarticulatory properties of words involves more than a pattern-matching process, then increasing the number of talkers present in the task should influence phonetic encoding. Specifically, coarticulatory specificity should be enhanced when multiple takers are present. Alternatively, if encoding of coarticulatory variation is a passive process, there should be no difference in specificity effects for listeners exposed to one or multiple talkers.
1.3 Current study
With these specific predictions in mind, we sought to test the role of fine-grained coarticulatory variation on repeated word recognition presented in lists with either one or two talkers. Degree of coarticulatory vowel nasality present on repeated words (and pseudowords) containing a nasal consonant was systematically varied. These words were presented to listeners who make a lexical decision. We examined the speed at which a word repeated with either identical or different degree of coarticulation is recognized. We also investigate whether recognition is affected when two speakers' voices are present, a more perceptually challenging condition, relative to when one voice is present.
2. Methods
The test items consisted of 32 words and 32 pseudowords, all monosyllabic with an equal number of CVN and NVC forms. Real words were highly familiar (familiarity ratings of 6+ on 7-point scale) and highly frequent [mean log frequency = 2.7 (SD = 0.6); SUBTLEX, Brysbaert and New, 2009] content words. Recordings were made by a phonetically trained male American English speaker using an Earthworks M30 microphone in a sound-attenuated booth.
In addition to the 64 test words, the speaker produced both a hyper-nasalized (NVN) and oral (CVC) minimal pair matching in vowel-identity, onset/coda place of articulation with each test item (i.e., 128 “donor” words or nonwords), required for manipulation of nasality degree. Using vowels taken from these “donor” items, degree of vowel nasality in the test items was manipulated to have an increased or decreased degree of coarticulation by additively combining the waveform of “donor” vowels with the waveform of the “recipient” vowels by formula using praat (the procedure is described in detail in Zellou et al., 2016 and Zellou et al., 2017). Vowels are first adjusted to match in amplitude, duration, and pitch (using psola). With identical f0s, the vowels' harmonic structures align in the frequency dimension and the vowels' spectra are additively combined, sample-by-sample, causing the relative amplitudes for the oral and nasal formant peaks to be modified. The resulting vowel is subsequently modified to display the same intensity and pitch contour as the original “recipient” vowel, and spliced back into its original context. Addition of the waveforms of the two vowels contributes the spectral characteristics of the nasal or oral “donor” to the “recipient,” creating a vowel varying in degree of nasality.
For increased coarticulatory nasality, the original “recipient” vowel and corresponding “donor” vowel (from NVN forms) were combined. Decreasing the degree of nasality was done in the same manner except that vowel nasality was reduced using oral vowels (from CVC forms); the waveform of an oral “donor” vowel was additively combined with the “recipient” vowel. For example, for the test word den, the isolated vowel was combined with a vowel from a nonword nen (which would be naturally more nasal) then spliced back into the original context, resulting in a more-coarticulated den; for a less-coarticulated den, the extracted vowel was combined with a vowel from dead then spliced back into the original context. This process yielded two versions of each test word, differing in degree of nasality: a “more” item containing an increased degree of coarticulatory vowel nasality and a “less” item containing a decreased degree of vowel nasality.
Degree of vowel nasality can be calculated as the relative difference in amplitudes of the first nasal formant peak (=P0, around 250 Hz, which increases in amplitude with increasing nasality) and the first oral formant peak (=A1, which decreases with increasing nasality) (Chen, 1997). A1-P0 dB decreases as vowel nasality increases. Acoustic nasality was measured as A1-P0 dB at vowel midpoint in each stimulus item. The acoustic properties of these stimuli are displayed in Fig. 1 (black triangles). For comparison, the grey circles represent mean acoustic nasality values of the speaker's naturally coarticulated tokens (CVN/NVC), as well as oral (CVC) and hyper-nasalized (NVN) real word productions. As seen, the less- and more-coarticulated stimuli were modified to reflect non-overlapping, subphonemic variations in degree of nasalization.
Mean acoustic nasality (in dB A1-P0) at vowel midpoint for “less” and “more” nasality real word stimuli used in the lexical decision task (black triangles). For comparison, mean nasality for this speaker's CVC, NVN, and CVN/NVC naturally produced real words (grey circles).
Mean acoustic nasality (in dB A1-P0) at vowel midpoint for “less” and “more” nasality real word stimuli used in the lexical decision task (black triangles). For comparison, mean nasality for this speaker's CVC, NVN, and CVN/NVC naturally produced real words (grey circles).
Subsequently, each of the more- and less-coarticulated items were modified to create two voices. Both f0 and formant frequencies (FF) of the stimuli were raised to a value roughly appropriate for an adult female. The result was two versions of each more- and less-coarticulated items: an apparent male (low f0/FFs) and an apparent female (high f0/FFs).
Stimuli were presented to 141 native English-speaking listeners who completed an auditory lexical decision task. During the experiment, listeners heard each of the 64 test items twice, once as prime and once as target, presented using eprime experimental software. For each trial, an item was played over headphones and listeners made a lexical decision (word or not a word), as quickly and accurately as possible by means of a button press using the eprime response box. Two factors varied across prime-target pairs within each list (Coarticulation condition [same or different] and Lag [immediate and delayed]). Prime-target pairs were split equally into two coarticulation conditions based on the acoustic differences between them: same coarticulatory pattern (balanced for “more” or “less”) or different coarticulatory pattern (balanced for “less-more” or “more-less”). Prime-target pairs were separated by one of two lag conditions: immediate or delayed (separated by exactly five trials). The prime-target Coarticulation conditions were balanced across Lags. Assignment of each real word and nonword to Coarticulation/Lag conditions was randomized for each listener.
Number of talkers was a factor that varied across lists. Participants were randomly assigned to a list containing either 1-voice (n = 69) or 2-voices (n = 72). In the 1-voice list, listeners heard only a single apparent talker; In the 2-voice list, listeners heard the same items and conditions except half of the prime-target pairs were presented in one apparent talker, the other half in the other apparent talker (voices were balanced across coarticulation and lag conditions).
3. Results
Facilitation was calculated by subtracting the response time (in ms) to a target word from the response time to its corresponding prime, for each subject. Larger values indicate a relatively faster response for the target. Only correct responses to real word trials were included in the analysis. Facilitation values were statistically analyzed using a linear mixed-effects model using the lmer() function in the lme4 package in r (Bates et al., 2014). Three categorical fixed effects were included: Coarticulation condition (two levels: Same, Different coarticulatory patterns present on vowels), Lag between prime and target (two levels: Immediate, Delayed), and Voice condition (two levels: 1-, 2-voice). By-subject random intercepts and by-subject random slopes for Coarticulatory variation condition were included. Estimates for degrees of freedom, t-statistics, and p-values were computed using Satterthwaite approximation with the lmerTest package (Kuznetsova et al., 2015).
The model revealed that Coarticulation condition significantly predicts priming facilitation [F(1, 964) = 59.7, p < 0.001], indicating that target items matching in coarticulatory properties with primes were recognized faster than mismatching targets. A significant effect of Lag [F(1, 3857.6) = 5.7, p < 0.05] confirmed that immediate priming facilitates target word recognition relative to delayed priming (198 ms facilitation vs 156 ms). Number of voices in the list predicted lexical priming [F(1, 143.9) = 17.6, p < 0.001]; facilitation was stronger in the 2-voice condition (215 ms) than in the 1-voice condition (137 ms), indicating that lexical access is faster when a word is repeated in lists where two voices are present, than when one voice is present.
Each of the two-way interactions were significant. Number of voices significantly interacted with both Coarticulation condition [F(1, 964) = 9.6, p < 0.001] and Lag [F(1, 3857.6) = 9.7, p < 0.001]. When two voices were present, listeners displayed greater facilitation for targets that matched with primes and there was greater facilitation at delayed lags than when one voice was present. The effects of number of voices on both coarticulatory specificity and lag can be explained if phonetic encoding is a cognitively active process. A significant interaction between Coarticulatory specificity and Lag [F(1, 3850.1) = 5.1, p < 0.05] reflects the fact that for coarticulatory matching pairs, facilitation from immediate and delayed priming is identical (245 ms vs 240 ms), yet for coarticulatory mismatching pairs facilitation is attenuated to a greater degree in delayed priming (74 ms) relative to immediate priming (152 ms).
Critically, there was a significant three-way interaction between Coarticulation condition, Lag and Number of voices [F(1, 3850.1) = 4.3, p < 0.05], illustrated in Fig. 2 (mean RTs for primes and targets, separately, are provided in Table 1). As seen, there is a consistent effect of coarticulatory specificity on priming: when stimuli mismatch in coarticulatory degree facilitation is reduced in both immediate and delayed priming. However, the amount of priming observed for stimuli with matching coarticulation is influenced by the number of speakers present. With two voices present, delayed priming of identical forms has much more robust facilitation than in 1-voice lists.
Mean facilitation of target words with matching or mismatching coarticulation primes, occurring either immediately or five trials prior, for 1- (left) and 2-voice lists (right). Error bars depict standard errors of the mean.
Mean facilitation of target words with matching or mismatching coarticulation primes, occurring either immediately or five trials prior, for 1- (left) and 2-voice lists (right). Error bars depict standard errors of the mean.
Mean reaction times (and standard errors of the mean) for prime and target real words with the same or different coarticulation patterns, occurring in immediate or delayed priming trials, for 1 - and 2-voice lists.
List . | Lag . | Coarticulation condition . | Prime mean RT (SE) . | Target mean RT (SE) . |
---|---|---|---|---|
One voice | Immediate | Same | 1154 (14.3) | 924.2 (10.6) |
Different | 1146.4 (13.9) | 981 (11.7) | ||
Delayed | Same | 1249.3 (17.7) | 1110.4 (11) | |
Different | 1170.5 (16.9) | 1103 (12.7) | ||
Two voices | Immediate | Same | 1145.5 (15.5) | 884.8 (12.5) |
Different | 1175.7 (18.1) | 1044.7 (14.7) | ||
Delayed | Same | 1565.1 (59.7) | 1211.7 (17.5) | |
Different | 1234.2 (26.7) | 1146.1 (19.3) |
List . | Lag . | Coarticulation condition . | Prime mean RT (SE) . | Target mean RT (SE) . |
---|---|---|---|---|
One voice | Immediate | Same | 1154 (14.3) | 924.2 (10.6) |
Different | 1146.4 (13.9) | 981 (11.7) | ||
Delayed | Same | 1249.3 (17.7) | 1110.4 (11) | |
Different | 1170.5 (16.9) | 1103 (12.7) | ||
Two voices | Immediate | Same | 1145.5 (15.5) | 884.8 (12.5) |
Different | 1175.7 (18.1) | 1044.7 (14.7) | ||
Delayed | Same | 1565.1 (59.7) | 1211.7 (17.5) | |
Different | 1234.2 (26.7) | 1146.1 (19.3) |
4. Discussion
We sought to test whether subphonemic vowel-coarticulatory information is stored in representations for words containing a nasal consonant. In a lexical decision task, greater amounts of facilitation in responding to repeated items was found when the prime and target contained identical coarticulatory patterns on their vowels. This supports the view that representations for spoken words contain rich phonetic detail (Goldinger, 1996), including coarticulatory patterns (Ohala and Ohala, 1995). Subphonemic coarticulatory details, for instance, whether the word is produced with more or less coarticulatory overlap on an adjacent vowel, is also stored in lexical representations.
Another unresolved aspect of phonetic representation is how long coarticulatory detail is retained in memory. There was an overall reduced effect of lexical facilitation in delayed priming, suggesting decay in coarticulatory specificity over time. Finding the preservation of coarticulatory specificity in lexical representations aligns with observations of systematic word-specific coarticulatory patterns in production (Scarborough, 2013), as well as differences in produced nasal coarticulation across regional and social groups of speakers (e.g., Zellou and Tamminga, 2014). Systematic variations in produced coarticulation can be explained if these patterns are encoded in representations, which can be updated via experience, that speakers subsequently recruit when talking.
Our second question was whether the encoding of coarticulatory variation is a cognitively active process. To test this question, we exposed listeners to either one or two talkers during the experiment. Switching between multiple voices, or even this expectation, during word perception is more challenging than hearing one voice (Mullennix et al., 1989). There is evidence that adapting to multiple talkers triggers additional active cognitive mechanisms (Nusbaum and Schwab, 1986), and this can explain why increasing the number of talkers has been shown to enhance memory for phonetic specificity (Nusbaum and Schwab, 1986; Heald and Nusbaum, 2014). We observed that having two voices present in the task increased coarticulatory specificity for delayed priming: forms with the same coarticulatory patterns had much greater amounts of late facilitation in two-speaker lists than in one-speaker lists. In other words, increasing the number of speakers led listeners to attend more strongly to coarticulatory surface details in words and retain them robustly over time. These findings suggest that word-specific subphonemic coarticulatory encoding is an active cognitive process; With more time and increased talker variability, listeners encode rich coarticulatory detail about words in memory.