Given recent interest in the analysis of naturally produced spontaneous speech, a large database of speech samples from the Canadian Maritimes was collected, processed, and analyzed with the primary aim of examining vowel-inherent spectral change in formant trajectories. Although it takes few resources to collect a large sample of audio recordings, the analysis of spontaneous speech introduces a number of difficulties compared to that of laboratory citation speech: Surrounding consonants may have a large influence on vowel formant frequencies and the distribution of consonant contexts is highly unbalanced. To overcome these problems, a statistical procedure inspired by that of Broad and Clermont [(2014). J. Phon. 47, 47–80] was developed to estimate the magnitude of both onset and coda effects on vowel formant frequencies. Estimates of vowel target formant frequencies and the parameters associated with consonant-context effects were allowed to vary freely across the duration of the vocalic portion of a syllable which facilitated the examination of vowel-inherent spectral change. Thirty-five hours of recorded speech samples from 223 speakers were automatically segmented and formant-frequency values were measured for all stressed vowels in the database. Consonant effects were accounted for to produce context-normalized vowel formant frequencies that varied across time.

The goal of this study was to examine vowel-formant patterns including vowel-inherent spectral change (VISC) (Nearey and Assmann, 1986) in both elicited (read) and spontaneous speech. This was accomplished through an acoustic analysis of a large database of speech samples collected in the Canadian provinces of Nova Scotia and Prince Edward Island. In order to account for consonant-context effects on vowel formant frequencies, a numerical procedure was developed using a method related to that of Broad and Clermont (1987, 2014) to estimate expected vowel formant frequencies for a fixed syllable frame, /hVd/, from a large pool of /CVC/ syllables where the onsets and codas are arbitrary. The method adopted here, like that of Nearey (2013), is additionally able to deal with more complex vowel targets including diphthongs and VISC. Unlike the models of either Nearey or Broad and Clermont, however, this new method makes no specific assumptions about the time course of either vowel-formant changes or of consonant-context effects. The temporal evolution of those factors can be examined post hoc and we are thus able to estimate the effects of varying consonant context across the duration of the vowel as well as the vowel target formant trajectories themselves. Although, this study focuses on methods to analyze large corpora of spontaneous speech, it actually represents the first stage in a larger study on regional phonetic variation in which the tools described here are be used (see the discussion in Sec. V).

The focus of this analysis is on formant frequencies in vowel production data. Recent studies have confirmed the importance of spectral peaks associated with formants as the key spectral properties by which listeners identify vowels (see Kiefte et al., 2013; Rosner and Pickering, 1994, for reviews). Although there are other models of speech perception that disagree with this view (e.g., Bladon and Lindblom, 1981; Bladon, 1982), it has recently been shown that spectral peaks convey at least as much discriminant information as any other spectral property investigated (Hillenbrand et al., 2006) and that, unlike gross spectral properties such as spectral tilt, the perception of spectral peaks is relatively unaffected by distortion of long-term spectral characteristics when formant frequencies are allowed to vary across time as in naturally produced speech (Kiefte and Kluender, 2005, 2008).

Studies have also shown that the traditional view of vowel production in which each vowel is specified by the frequencies of the first two or three spectral peaks from the (near) steady-state portion of the vowel, or vowel nucleus, is inadequate. For example, Andruski and Nearey (1992), Nearey and Assmann (1986), and Hillenbrand et al. (1995) showed that lax vowels such as /ɪ/ and /ɛ/ may be characterized by the same magnitude of spectral change as observed for acknowledged diphthongs such as /eɪ/. Later studies have shown that VISC is critical to vowel perception (e.g., Assmann and Katz, 2000) and that formant trajectories should be considered at least in addition to vowel-nucleus formant frequencies.

In this study, we describe vowels in terms of overall formant patterns, including apparent VISC.1 Because samples were analyzed from spontaneous speech, we developed a method to account for the influence of consonant contexts, as these are known to have a profound effect on the spectral characteristics of vowels (e.g., Hillenbrand et al., 2001; Labov et al., 2006; Nearey, 2013). Preliminary to a more detailed investigation of subtler differences among dialect variants in the Canadian Maritime region, we focused on two tasks to assess the methods adopted here. Specifically, an acoustic analysis was carried out to examine patterns of speech production based on measurement of the following:

  1. The first two formant frequencies of the citation forms at several time points throughout the duration of the vowel to enable characterization of VISC patterns (cf. Assmann and Katz, 2000; Hillenbrand et al., 1995; Nearey, 1989). Averages of these patterns from the entire dataset are then compared broadly to similar data from Hillenbrand et al. in a study of the Midwestern US vowels.

  2. Differences in formant-frequency patterns between elicitation techniques—specifically between read citation-form and conversational or spontaneous speech (e.g., Labov et al., 1972).

Because conversation or spontaneous speech is necessarily uncontrolled, it is impossible to ensure that specific syllables of interest will be recorded in data collection. For example, although /hVd/ syllables are frequently elicited in laboratory speech studies, several of these words such as /hʌd/ or /hɑd/ are unlikely to be produced in conversational speech which makes it difficult or impossible to make direct comparisons between spontaneous and read speech. In addition, because we are interested in VISC, it is important to obtain estimates of target vowel formant frequencies near the onsets and offsets of the vowels where we expect the relatively large coarticulatory influence from onset and coda consonants. Because of these issues, we have developed a method, described below, to account for these consonant contexts in order to obtain an estimate of vowel formant frequencies as would be produced in /hVd/ context—i.e., we estimate consonant context effects and then predict formant trajectories for each vowel in /hVd/ syllables as references from which to make comparisons.

This reference context was chosen for two reasons: It allows us to make direct comparisons with data from studies in which this syllable was explicitly elicited from speakers and, second, while /d/ cannot be considered a “neutral” context, /h/ has little coarticulatory influences on the following vowel.

Audio recordings were collected from 225 adult, English-speaking, long-term residents of Nova Scotia and Prince Edward Island (223 were retained for analysis as two did not meet our inclusion criteria described below). The geographic distribution of speakers is given in Fig. 1.

FIG. 1.

(Color online) Map of Nova Scotia and Prince Edward Island with relative location within Canada inset. Symbols indicate locations of speakers in the present study using gender symbols for male and female speakers.

FIG. 1.

(Color online) Map of Nova Scotia and Prince Edward Island with relative location within Canada inset. Symbols indicate locations of speakers in the present study using gender symbols for male and female speakers.

Close modal

Prior to audio recording of speech samples, participants first completed a demographic survey which asked the speakers' age, income, years of education, language experiences, locations where they have resided, etc. In all cases, only long-term residents who were monolingual speakers of English with no more than grade 12 education were analyzed (i.e., 223 speakers). This was followed by three production tasks which were recorded. In the first two tasks, speakers were asked to read a series of sentences. In the third task, speakers were allowed to speak spontaneously on any subject of their choosing. This allowed us to compare vowels in both spontaneous and elicited speech.

In the first task, speakers were asked to read a series of sentences of the form “Please say the word /hVd/ again” (cf. Assmann and Katz, 2000). The sentence-final word after the target ensured more natural stress and rhythm. Vowel targets included 14 monophthongs and diphthongs that are reported to occur in this dialect of English (Kiefte and Kay-Raining Bird, 2010). In addition, an attempt was made to elicit distinct vowels for /ɔ/ and /ɑ/, resulting in a total of 15 vowel targets. For the purposes of elicitation, they were written in a standard English orthography—e.g., Please say the word ___________ again where the target word was one of heed, hid, hayed, head, had, who'd, hood, hode, hud, hawed, hod, heard, hide, hoyed, how'd. The /hVd/ context was chosen so that we would be able to make direct comparisons between vowels collected here and those from other datasets that use a similar frame. In addition, this context was used as the reference for the modeling of consonant-context effects as described below.

We also wished to collect elicited speech samples in a much greater variety of contexts which more closely approximate naturally produced speech. Therefore, in the second task, speakers were asked to read twenty sentences from the TIMIT database (Fisher et al., 1986). Following the procedure of Fisher et al., all speakers read the two SA (shibboleth) sentences from the database: She had your dark suit in greasy wash water all year; Do not ask me to carry an oily rag like that. In addition, speakers read 11 randomly selected SX sentences which are described as “phonetically compact” and 7 randomly selected SI sentences which are described as “natural.” These recordings were used to train the CMUSphinx3 speech recognizer (Lee, 1989) which was one of the two systems used for automatic phoneme alignment of the entire dataset described here.

The final task consisted of a spontaneous monologue in which the participants were free to talk about any subject of their choosing—e.g., politics, sports, hobbies, weather, etc. These monologues ranged in duration from 2:46 to 36:58 min and resulted in a total of 34 h 33 min of recordings across all 223 speakers. These recordings were then manually segmented into breath groups and screened to exclude items with noise (e.g., coughs, interruptions, or background sounds) or disfluencies, such as partial repetitions, incomplete words, etc. Breath groups were saved as individual files.

An orthographic transcription of each of these short recordings was produced by human listeners. Where a portion of the utterance was unclear or ambiguous, the recording was tagged as such in the filename and not analyzed further. These orthographic transcriptions were used in the automatic phoneme alignment described below.

1. Phoneme alignment

A phonemic transcription was generated by a version of the CMU Pronouncing Dictionary modified to include words produced in the database that were not already present. The recorded breath groups from the spontaneous speech database were then processed by both CMUSphinx3 (Lee, 1989) and P2FA (Penn Phonetics Lab Forced Aligner; Yuan and Liberman, 2008). Outputs from these programs were processed to produce phoneme alignments appropriate for use with phonetics software such as praat (Boersma, 2001). While P2FA requires no new training, CMUSphinx3 does. It was trained on the entire database including the TIMIT sentences, the read speech, and the spontaneous speech before being used in the segmentation process. It was thought that this speaker-dependent training of the CMUSphinx3 speech-recognition software might lead to better alignment than would P2FA which uses a database from a different dialect, vocabulary, and acoustic environment.

In all analyses described here, only primary stressed vowels as indicated by the CMU Pronouncing Dictionary were considered. Grammatical words such as pronouns, auxiliaries, prepositions, etc. (e.g., had, could, he'd, etc.) were excluded due to the likelihood of reduced stress or vowel reduction. As a further screening tool to eliminate very brief vowel productions, all vowel segments for which no voicing was detected via praat or which were shorter than 50 ms were also discarded prior to further analysis. This eliminated vowels that would have been implausibly short as well as those segmentations that were highly inaccurate. In addition, adjacent-consonant contexts were limited to obstruents2 /b, d, ɡ, p, t, k, f, θ, s, ʃ, v, ð, z, ʒ, h, tʃ, dʒ/ because of difficulties in aligning vowel boundaries near approximants or other vowels. In total, over 38 000 vowels in /CVC/ context with primary stress were extracted from the spontaneous-speech recordings.

Results from the forced-alignment are given in Table I. To validate the results of the forced-alignment algorithms, 1175 vowels from the spontaneous-speech database in /CVC/ context were hand segmented by trained graduate students of speech science who had taken courses in speech acoustics and phonetics. The locations of phoneme boundaries as determined by these raters were compared with those from the two systems, CMUSphinx3 and P2FA. CMUSphinx3 gave relatively poor results: only 67% and 46% of the phoneme boundaries were within 20 ms of hand segmented vowels for onsets and offsets, respectively (data are not presented in Table I), despite the speaker-dependent training of the speech-recognition models. However, results for P2FA were substantially better: 77% and 81% of boundaries were within 20 ms for vowel onsets and offsets, respectively.

TABLE I.

Comparisons between automatically and hand-aligned vowel segments. The table gives the percentage of automatically aligned boundaries that are within the given error when compared to hand-aligned boundaries. Rows marked P2FA+post show the improvements gained following additional post-processing of automatically generated boundaries. The rows marked interrater indicate agreement between hand-aligned boundaries provided by two different judges.

Onsets
Error (ms)
< 10< 20< 30< 40
P2FA 49.44 77.07 88.37 92.28 
P2FA+post 67.20 82.96 89.39 92.39 
Interrater 90.28 91.67 92.01 92.88 
Onsets
Error (ms)
< 10< 20< 30< 40
P2FA 49.44 77.07 88.37 92.28 
P2FA+post 67.20 82.96 89.39 92.39 
Interrater 90.28 91.67 92.01 92.88 
Codas
Error (ms)
< 10< 20< 30< 40
P2FA 54.36 81.10 90.94 93.96 
P2FA+post 54.66 81.24 91.00 94.11 
Interrater 86.81 92.01 92.71 93.23 
Codas
Error (ms)
< 10< 20< 30< 40
P2FA 54.36 81.10 90.94 93.96 
P2FA+post 54.66 81.24 91.00 94.11 
Interrater 86.81 92.01 92.71 93.23 

An additional stage of post-processing was added to the automatic segmentation: For plosives and unvoiced fricative contexts (as well as affricates), the /CV/ and /VC/ boundaries were adjusted to the nearest onset/offset of voicing within the syllable if they were found to be within 50 ms of the original segmentation. These results are indicated in Table I as P2FA+post which shows substantial gains in accuracy for vowel onsets—i.e., 83% were found to be within 20 ms. No improvement was found for vowel offsets. The P2FA+post segmentations were the ones used in the further analyses described here.

Segmentations from the two manual phoneme alignments were compared to evaluate interrater agreement and to establish a gold standard against which the automatic segmentation can be compared. Interrater agreement (±20 ms) was approximately 92% for both vowel onsets and offsets suggesting that P2FA performance was somewhat poorer than hand alignment (cf. 94%, Hosom, 2009, for citation speech).

2. Formant tracking

An unsupervised formant tracker (Nearey et al., 2002; Zhang et al., 2013a) was used to measure formant frequencies for the extracted vowels. Formant measurements were extracted at nine equally spaced points throughout the duration of each vowel—i.e., centered at 10%, 20%,…, 90% of total vowel duration. Analysis windows were centered on each of these time points; the length of the analysis windows was 25 ms. Because all vowels with duration less than 50 ms were removed from the database, there was minimal overlap between the first and last analysis frames.

To evaluate the performance of the automated formant-frequency measurement, these estimates were validated against supervised formant measurements for the 1175 randomly selected /CVC/ tokens described above. The results of these comparisons are presented in Table II under Automatic vs Manual for the first two formants measured at 20% and 70% of the total vowel duration (hereafter referred to as onglide and offglide; Nearey and Assmann, 1986; Hillenbrand et al., 1995).

TABLE II.

rms error for automatic and manual formant measurements. Automatic vs manual gives the rms error between the automatic procedure for measurement of formant frequencies and supervised formant measurement in praat by human raters. Fit vs manual gives the rms error between the fitted formant-frequency estimates from the model described in Sec. IV and the empirically hand-measured formant frequencies.

Automatic vs manual
rms error (Hz)
F1F2
Onset (20%) 137 195 
Offset (70%) 75 152 
Automatic vs manual
rms error (Hz)
F1F2
Onset (20%) 137 195 
Offset (70%) 75 152 
Fit vs manual
rms error (Hz)
F1F2
Onset (20%) 108 228 
Offset (70%) 106 234 
Fit vs manual
rms error (Hz)
F1F2
Onset (20%) 108 228 
Offset (70%) 106 234 

These results are much higher than those observed for reliability of human raters by Zhang et al. (2013b) who reported 25 and 68 Hz root-mean-square (rms) error for F1 and F2, respectively. However, their results are reported from full formant tracks and it is possible that formant frequencies from vowel centers may be more stable than ones nearer vowel onsets and offsets. In addition, there are two sources of error that contribute to the rms error reported in Table II: the unsupervised formant tracker and the automatic segmenter. Therefore, the total rms error includes errors originating in both the frequency and time domains. In contrast, the vowel in the database used by Zhang et al. were manually located.

3. Formant-frequency normalization

Formant-frequency measurements were log-mean normalized for each speaker in a manner similar to Labov et al. (2006) (see also Adank, 2003, for a review of normalization techniques). For each vowel produced in citation-form /hVd/ context produced by each speaker, formant frequencies were estimated at the temporal midpoint for both F1 and F2. Then for each speaker and for each of F1 and F2, the log-mean of the formant-frequency estimates was calculated and formant measurements for both citation-form and spontaneous speech were then normalized to match the grand log mean from the pool of speakers. Unlike Labov et al., F1 and F2 were normalized separately.

Figure 2 shows normalized log-average vowel formant frequencies from citation-form /hVd/ data collected in the first speech task described above from the Maritimes (NS) as well as similar /hVd/ data from the American Midwest (primarily Michigan; MI) from Hillenbrand et al. (1995). For the NS data, formant-frequency estimates from all time frames between 20% and 80% vowel duration are given in 10% steps. For the MI data, only the 20%, 50%, and 80% measurements are given. Relative to the Michigan data, the speech sample collected in the Canadian Maritimes shows a low-back merger of /ɔ/ and /ɑ/ that is generally reported for all Canadian dialects as well as the lowering and backing of /æ/ which is commonly associated with this merger and which has been referred to as the Canadian Shift (Boberg, 2010; Clarke et al., 1995; Labov et al., 2006). While the offglide portions of /æ/ in the NS dialect is similar to that of MI, the onset and vowel nucleus are very different. Conversely, it should be noted that the Michigan dialect has a raised /æ/ which is consistent with the Northern Cities Shift (Boberg, 2010; Hillenbrand, 2003). Although the low-back vowel is frequently transcribed as /ɑ/ in American dialects, in fact, [ɒ] looks like a reasonable transcript for both NS /ɔ/ ∼ /ɑ/ and for MI /ɔ/ as this value is intermediate between values of /ɔ/ and /ɑ/ as reported in the study by Peterson and Barney (1952) (see Hillenbrand et al., 1995).

FIG. 2.

Comparison of VISC in citation-form /hVd/ speech for 11 vowels in Maritime Canada (NS) and Midwestern (primarily Michigan) United States (MI; Hillenbrand et al., 1995). For each vowel, the respective curve spans the formant values from 20% to 80% of the vowel duration. The MI data are the subset of adult speakers from Hillenbrand et al.: Formant measurements were log normalized for each speaker and scaled to the geometric mean of the sample. NS data were also log normalized. The symbol identifying the vowel is plotted at formant frequencies at 20% vowel duration (onset). In the case of the MI data, only the 20%, 50%, and 80% formant frequency measurements are given, while the formant frequencies for the NS data are plotted for every 10% of the total vowel duration for a total of eight points from 20% to 80%. See Sec. II B 2 for details regarding the analysis.

FIG. 2.

Comparison of VISC in citation-form /hVd/ speech for 11 vowels in Maritime Canada (NS) and Midwestern (primarily Michigan) United States (MI; Hillenbrand et al., 1995). For each vowel, the respective curve spans the formant values from 20% to 80% of the vowel duration. The MI data are the subset of adult speakers from Hillenbrand et al.: Formant measurements were log normalized for each speaker and scaled to the geometric mean of the sample. NS data were also log normalized. The symbol identifying the vowel is plotted at formant frequencies at 20% vowel duration (onset). In the case of the MI data, only the 20%, 50%, and 80% formant frequency measurements are given, while the formant frequencies for the NS data are plotted for every 10% of the total vowel duration for a total of eight points from 20% to 80%. See Sec. II B 2 for details regarding the analysis.

Close modal

With regards to apparent VISC, the movement patterns for /e/ and /o/ are generally similar between dialects as expected from traditional transcriptions [eɪ] and [oʊ]. The formant trajectory for NS /æ/ is much shorter, compared to that of MI which is sometimes transcribed as [ɛə]. However, the formant movements of all the vowels may also be influenced by the /hVd/ frame which was chosen for historical reasons (e.g., Peterson and Barney, 1952) but which allows us to make direct comparisons to the data from (Hillenbrand et al., 1995).

The highly variable consonant contexts of spontaneous speech preclude a direct comparison between spontaneous and fixed-context citation speech samples except for vowels that happen to occur in the same /hVd/ context. As shown in Table III, this is not practical given that only a few such syllables occur in the spontaneous-speech corpus and that, of those that do occur, some are relatively infrequent (e.g., /hʌd/ which occurs only in a proper name Hudson). In contrast, syllables that occur relatively frequently are found almost exclusively in grammatical function words (had, he'd, who'd) and are expected to be short and unstressed in spoken utterances. These syllables are therefore likely subject to much greater vowel reduction.

TABLE III.

Total counts of /hVd/ syllables in the spontaneous-speech corpus by vowel. Vowels that are not indicated in the table did not occur in /hVd/ context in the database. Word gives the most frequent word in which the syllable occurs while prop. gives the proportion of /hVd/ syllables accounted for by the most frequent word.

SyllableTotalWordProp. (%)
/hæd/ 617 had 99.8 
/hʌd/ Hudson 100.0 
/hɛd/ 73 head 78.0 
/hɪd/ hidden 66.7 
/hid/ 77 he'd 100.0 
/hʊd/ hood 100.0 
/hud/ who'd 100.0 
SyllableTotalWordProp. (%)
/hæd/ 617 had 99.8 
/hʌd/ Hudson 100.0 
/hɛd/ 73 head 78.0 
/hɪd/ hidden 66.7 
/hid/ 77 he'd 100.0 
/hʊd/ hood 100.0 
/hud/ who'd 100.0 

By contrast, there are over 38 000 /CVC/ syllables in our corpus which are indicated as having primary stress in the CMU Pronouncing Dictionary, which do not serve as function words grammatically, and where consonant contexts are obstruents. While these syllables may or may not actually carry perceptually salient stress in a given sentence, they are the more likely to do so than syllables assigned lower stress levels in the dictionary or those that occur in grammatical words.

To take best advantage of vowels recorded in the spontaneous-speech database, we require a method to correct for consonant-context effects so that we can compare citation /hVd/ and spontaneous tokens more appropriately. While we are currently developing nonlinear regression techniques to estimate vowel-dependent and consonant-induced context effects simultaneously for /CVC/ formant trajectories across the full duration of vowels (along the lines of Broad and Clermont, 1987; Nearey, 2013), so far, such full-trajectory models are very difficult to use in phonetically sparse and statistically unbalanced spontaneous speech samples. The sparsity and lack of balance in the present data is extreme: Considering only obstruents as consonant contexts (16 possible onsets and 16 possible codas where /h/ is not a possible coda and /ʒ/ is not a possible onset) and fifteen vowels (including both /ɔ/ and /ɑ/), there are 3840 possible syllables that can be modelled. However, only 1353 distinct syllables are represented in the database of which 293 occur only once. Though we continue to study the problem, the imbalance in the data appears to pose severe issues for straightforward application of modeling techniques that have been proposed.

In the interim, we have developed simplified models that estimate and correct for consonant effects on vowel formant frequencies for individual temporal frames. The consonant correction effects at a given temporal frame were modelled by optimizing (by nonlinear regression techniques3) the formula

FC1VC2(t)=[aC1(t)+aC2(t)+1]xV(t)+bC1(t)+bC2(t),
(1)

where FC1VC2 are the estimated formant measures from the unsupervised formant tracking (for either F1 or F2), t represents the time frame analyzed as a percentage of total vowel duration from 10 to 90% in 10% increments, aC1(t) and bC1(t) are initial consonant-context parameters, aC2(t) and bC2(t) are final consonant-context parameters, and xV(t) is the estimated formant target at the given time frame. For both the onset and coda, the a coefficients represent context-dependent scale parameters while the b coefficients represent context-dependent bias—i.e., consonant contexts can both scale and shift the target formant frequency.

The formant-frequency target (xV) is therefore the predicted vowel formant frequencies for a reference consonant context—one for which all a and b parameters are zero. Ideally, this reference context would be the null context—i.e., an isolated vowel with no syllable onset or coda. In this model, we cannot extract an estimate of formant frequencies of isolated vowels because there are very few null onsets or codas in the spontaneous-speech data. Even vowel-initial or vowel-final words are typically subject to some coarticulation with the consonants of neighboring words where there are no intervening silences. For the consonant-context parameters a and b, it was decided to assign the /hVd/ context as the “reference” context that would have zero parameters for both the scale and bias terms for syllable onset and coda as this would allow direct comparison with previous results (e.g., Hillenbrand et al., 1995). However, the choice is completely arbitrary.

Without this constraint the models are indeterminate. If we set aC1 and bC1 to zero for initial /h/ and aC1 and bC2 to zero for final /d/, we are effectively mapping all /CVC/s to the standard /hVd/ context for which we have empirical measurements in the citation-speech database. When the coefficients are fixed in this way, the estimated vowel coefficient is an estimate of the target formant value at the selected time point in a /hVd/ syllable—not that of an isolated vowel. We note that this may leave some residual consonant-context effects, both from approximation error and from consonantal effects due to initial /h/ and final /d/. However, this is also true of the empirical /hVd/ observations and this method should thus render spontaneous speech measurements more nearly comparable to the /hVd/ utterances in elicited speech.

The model in Eq. (1) can account for consonantal context effects at a single time slice that are consistent with relationships imposed by the exponential relative-time model IVa of Broad and Clermont (1987) for /CV/ or /VC/ trajectories. This model is further refined by Broad and Clermont (2014; Eq. 1) as

FCVC(n)=TV+(TVLC)GC(n)+(TVLC)GC(n),
(2)

where n is the time frame, TV is the vowel formant target, LC and LC are consonant-specific parameters for the onset and coda consonants, respectively, and GC(n) and GC(n) are transition-shape functions related to the onset and coda, respectively. This is algebraically equivalent to

FCVC(n)=[GC(n)+GC(n)+1]TV+[LCGC(n)LCGC(n)],
(3)

where the relationship between the models of Broad and Clermont and the one presented here in Eq. (1) is more apparent. The primary differences are that the parameters in Eq. (1) are completely unconstrained by time frame, i.e., xV is not constant across all values of t which allows us to model VISC. In addition, whereas GC and GC represent parametric functions (viz., exponential), the a and b parameters in Eq. (1) are estimated directly by fitting the equation to the data via least-squares optimization at each time frame.

The present model preserves some of the constraints of the Broad and Clermont (1987, 2014) model: namely, that there are additive effects of vowel targets and systematic deviations due to onset and coda consonants from those targets at each slice in time. However, by removing some constraints, the model in Eq. (1) also offers some real advantages in exploratory analysis of a large database: We can estimate empirically any shape (including VISC) of any formant-trajectory pattern in a baseline reference /hVd/ frame as well as any pattern of changes in the magnitude of consonant-context effects across time.

Because vowel formant targets and context effects are estimated independently for each frame, it is possible for the model to predict formant trajectories that are physiologically impossible to produce. For example, the model could predict large jumps or discontinuities in target frequency for a given vowel. This potential volatility makes the model unsuitable for anything but large data sets. However, the volume of data and the degrees of freedom in the model ensure that variability across adjacent frames is minimal. Although it is possible that the model will predict implausibly sharp jumps in formant trajectories, these should be rare given the volume of the input data.

It can be noted that the model described by Eq. (1) is similar in form to locus equations (Sussman et al., 1998) in which a place-specific linear relationship is proposed between formant onset and vowel-center formant frequencies (which roughly approximates vowel targets) for CVs, combined with analogous ones for final VCs. With the model described in Eq. (1), we posit an underlying formant frequency at a time slice that is influenced by the effects of adjacent consonants in a manner similar to the way vowel onsets and offsets are modelled by locus equations.4 Whereas locus equations model the relationship between the observed vowel-center and onset formant frequencies, the present model posits a linear relationship between observed formant frequencies and target formant frequencies that cannot be measured directly but which are subject to consonant-context effects. In addition, vowel-target formant frequencies in our model are not restricted to observed values at the vowel center. While locus equations are used primarily to model the production and perception of stop-consonant place of articulation across vowel contexts, we use the present model to describe systematic variation in vowel formant frequency for a given vowel category across consonant contexts (see also, Kluender et al., 2013).

The 38 000+ log-normalized formant tracks for F1 and F2 from stressed vowels described in Sec. II were used to estimate the parameters for the model in Eq. (1). Figure 3 shows fitted estimates of formant frequencies for the syllable /bæb/ showing the relative effects for onset- and coda-consonant context. For example, relative to target in Fig. 3 for the reference context /hæd/ [i.e., the time function xæ(t) from Eq. (1)], the syllable-initial /b/ (onset) lowers F2 near the onset which is consistent with the locus-equation view of bilabial frequency targets. In addition, the /b/ onset lowers F1 towards the onset of the syllable which is also expected. Similarly, the figure also shows that the coda /b/ (coda in Fig. 3) lowers the offglide of F2 relative to /hæd/. Somewhat surprisingly, F1 is raised throughout the duration of the vowel with /b/ codas relative to the context effects for /d/ codas as in /bæd/ or /hæd/, respectively. This suggests that the context effects can extend beyond the immediate vicinity of the consonant itself.

FIG. 3.

Estimated effects of changes in consonant context on vowel formant frequencies in the syllable /bæb/. Each curve shows an estimate from the fitted model in Eq. (1). Target shows the estimated formant-frequency trajectory for the syllable /hæd/ (i.e., the reference context); onset shows estimated formants for /bæd/—i.e., the target including differential effects for the /b/ onset; coda shows estimated formants for /hæb/; and result shows estimated formants for /bæb/.

FIG. 3.

Estimated effects of changes in consonant context on vowel formant frequencies in the syllable /bæb/. Each curve shows an estimate from the fitted model in Eq. (1). Target shows the estimated formant-frequency trajectory for the syllable /hæd/ (i.e., the reference context); onset shows estimated formants for /bæd/—i.e., the target including differential effects for the /b/ onset; coda shows estimated formants for /hæb/; and result shows estimated formants for /bæb/.

Close modal

Figure 4 shows similar data for the syllable /bæk/ but also shows the empirically observed formant frequencies for productions of /bæk/ directly from the unsupervised formant tracking procedure described above. This syllable was selected for comparison between the fitted and observed formant frequencies as it is the most frequently occurring /CVC/ syllable that does not occur in a grammatical function word (459 tokens in total). In contrast, the syllable /bæb/ occurs only twice and cannot provide reliable empirical estimates. Figure 4 shows a remarkably good fit from the model when compared with the observed data. In this case, the coda /k/ shows a raising of F2 towards the end of the vowel as expected. As with /bæb/, the coda /k/ also shows an overall raising of F1 relative to /hæd/ which suggests that it may be the coda /d/ that pushes the F1 for /æ/ downward in frequency.

FIG. 4.

Estimates of consonant-context effects on vowel formant frequencies in the syllable /bæk/. Target and onset are exactly the same as those in Fig. 3. Coda shows estimated formants for /hæk/; and result shows estimated formants for /bæk/. Observed shows the means of formant frequencies observed from 459 tokens of the syllable /bæk/ from the database.

FIG. 4.

Estimates of consonant-context effects on vowel formant frequencies in the syllable /bæk/. Target and onset are exactly the same as those in Fig. 3. Coda shows estimated formants for /hæk/; and result shows estimated formants for /bæk/. Observed shows the means of formant frequencies observed from 459 tokens of the syllable /bæk/ from the database.

Close modal

For the estimates illustrated in Fig. 3, the lack of /bæb/ tokens does not present a problem to the modeling which, in this case, depends only having enough occurrences of /b/ onsets (4368), /b/ codas (891), and /æ/ vowels (6811). Indeed, /bæb/ could have been modeled even if there were no occurrences of the syllable in the database. There are 1353 unique syllables in the database excluding unstressed syllables and function words and, of those, 807 of them occur fewer than ten times. However, we are able to model syllables like /bæb/, which occurs only twice in the entire database, because there are no interaction parameters in the model described by Eq. (1)—onset and coda effects are estimated given the entire dataset. Nonetheless, Fig. 4 gives a good indication of the model fit for a syllable that occurs frequently enough that empirical formant-frequency tracks to be meaningful.

Figure 5 compares the estimates of the vowel targets for /hVd/ from spontaneous speech with the observed formant frequencies from citation speech samples of /hVd/ (NS from Fig. 2). The figure shows fronting of /u/ and the backing of /ɪ/ and /ɛ/ in spontaneous speech and greater formant movement of /ɑ/ which, similar to citation speech, has been merged with /ɔ/. In addition, we see dramatic changes in the formant trajectory for /eɪ/ and /oʊ/ in spontaneous speech relative to citation forms. This is also evident in /i/ and may be due to greater context effects from the /d/ coda in spontaneous speech.

FIG. 5.

Comparison of VISC in spontaneous and citation-form speech for 11 vowels. Here citation refers to empirically measured /hVd/ syllables from the read-speech task while spontaneous refers to fitted formant-frequency estimates from the spontaneous-speech corpus after accounting for variable consonant-context effects. Similar to Fig. 2, the symbol identifying the vowel is plotted at formant frequencies at 20% vowel duration and other end of each curve gives formant frequencies at 80% duration. See Sec. II B 2 for details regarding the analysis.

FIG. 5.

Comparison of VISC in spontaneous and citation-form speech for 11 vowels. Here citation refers to empirically measured /hVd/ syllables from the read-speech task while spontaneous refers to fitted formant-frequency estimates from the spontaneous-speech corpus after accounting for variable consonant-context effects. Similar to Fig. 2, the symbol identifying the vowel is plotted at formant frequencies at 20% vowel duration and other end of each curve gives formant frequencies at 80% duration. See Sec. II B 2 for details regarding the analysis.

Close modal

Because we attempted to elicit both /ɑ/ and /ɔ/ from speakers that purportedly have no such phonemic distinction (see Labov et al., 2006; Kiefte and Kay-Raining Bird, 2010), we are allowed an opportunity to evaluate the variability of the results of the formant fitting algorithm in a manner similar to assessing split-half reliability. For example, in Fig. 5, the spontaneous-speech formant tracks for /ɑ/ and /ɔ/ are remarkably similar although there is a slight difference between them in F2. There are a number of reasons why this difference may occur and these are explored in the discussion in Sec. V. Nonetheless, the tracks are almost entirely parallel suggesting that the algorithm is relatively reliable.

Above, it was suggested that the /d/-coda context may result in an overall lowering of F1 for /æ/ throughout its duration. Figure 6 shows vowel formant trajectories for both /hVd/ and /hVb/ syllables as estimated by the fitting procedure described above. The figure shows that, except for the close vowels /i/ and /u/, F1 is higher at the vowel onset for /hVb/ syllables than for /hVd/ syllables. The second formant appears to be very similar at the vowel onset for both /hVd/ and /hVb/ syllables. As expected, however, F2 decreases towards vowel offglide for /b/ codas.

FIG. 6.

Comparison of fitted /hVd/ and /hVb/ formant trajectories in spontaneous speech. The solid lines are identical to the dotted lines in Fig. 5. Similar to Figs. 2 and 5, the symbol identifying the vowel is plotted at formant frequencies at 20% vowel duration and the endpoint is at formant frequencies at 80% duration. See Sec. II B 2 for details regarding the analysis.

FIG. 6.

Comparison of fitted /hVd/ and /hVb/ formant trajectories in spontaneous speech. The solid lines are identical to the dotted lines in Fig. 5. Similar to Figs. 2 and 5, the symbol identifying the vowel is plotted at formant frequencies at 20% vowel duration and the endpoint is at formant frequencies at 80% duration. See Sec. II B 2 for details regarding the analysis.

Close modal

Our work so far has focused on a primarily methodological question: Given only a transcript of the data, can we automatically segment, track formant frequencies, and model the coarticulatory effects of a large database to produce meaningful summaries of vowel patterns over a wide range of consonant contexts? The preliminary answer is yes. Other uses of the method might be to identify potentially interesting subsets of the data. Trajectories of individual productions that vary greatly from the expected patterns can be checked and error patterns can be clustered to screen outliers or identify sub-dialects. While we do sketch some potential examples below, it remains to be seen whether these methods will be more helpful in addressing specific issues of interest in dialectology or sociophonetics than previous techniques.

Although the model appears to account for consonant-context effects well, it cannot accommodate certain types of allophonic variation. For example, in Canadian raising, the onglides of the diphthongs /aɪ/, /aʊ/, and /ɔɪ/ are raised before voiceless codas (Chambers, 1973). In the Canadian Maritimes, this raising produces some degree of monophthongization for /aʊ/ which is often produced as [ɔʊ] (Kiefte and Kay-Raining Bird, 2010). While the model as defined in Eq. (1) has distinct terms for each of these three diphthongs in xV as well as terms for voiceless plosives in coda position, aC2 and bC2 that model consonant-context effects, there are no interaction terms that will allow vowel targets xV to vary across consonant contexts. The model expects that all productions of a given vowel will have the same underlying formant-frequency target and therefore, the estimated targets will be distorted by this type of allophonic variation. This issue could potentially be addressed via the current framework via two routes: First, after fitting a general model without allophonic variation, lists of test words with and without the suspected allophonic variants can be prepared. Patterns of deviations of the two lists of words from predictions of the models could be compared in light of the possibility of differing allophonic alternatives. Second, a model can be tested in which vowels found in these contexts are labeled as a separate vowel category with a highly unbalanced distribution of consonant contexts. A revised model could then be refit treating the suspect allophones as though they were separate phonemes (by labeling vowels in the two contexts with distinct vowel labels). A comparison of the decrease in overall errors could be used as a rough measure of the importance of the additional allophonic categories.

One problem with the present data set is that it includes some diversity in speech patterns as the sample spans an area that includes several regions known to have speakers of dialects that are considered distinct (Kiefte and Kay-Raining Bird, 2010). This variation can be traced back to historic settlement patterns of the last three centuries. For example, residents of the island of Cape Breton speak dialects very similar to those spoken in Newfoundland owing perhaps to the large influx of settlers from there in the 19th century. In contrast, speech along the South Shore is frequently non-rhotic, similar to that of New Englanders who largely settled in this area in the 18th century. Both of these dialects are again distinct from that spoken in Halifax, the largest city in the region. This diversity is rarely recognized and many authors even fail to distinguish Maritime English from other varieties in Canada (e.g., Rogers, 2000; Wells, 1982). With respect to the modelling discussed here, this diversity may substantially increase variability in estimates of the model parameters and, by extension, fitted formant frequencies for vowel targets and consonant-context effects. One way to address this would be to partition the database into geographic groupings and determine model coefficients separately for each. In fact, this dialect region was chosen specifically for the present study so that the methods described here could be used to aid in the description of dialect variation in vowels—including any differences in patterns of VISC—and this analysis has been reported elsewhere (e.g., Kiefte and Nearey, 2015). The conclusion of that investigation, however, was that purported differences in regional dialect were substantially smaller than those found between, e.g., Maritime Canadian and other varieties of Northern American English which is why we are somewhat justified in pooling the database without dialectal subgroupings for the analysis presented here.

Another issue that is difficult to address is the influence of word frequency on acoustic characteristics of individual vowels. Although the database is screened for function words that are generally unstressed and produced rapidly, it is difficult to control for frequency effects on content/lexical words. For example, Fig. 4 shows the observed and predicted formant track for the syllable /bæk/ which occurs exclusively in the word back and its variants such as backpack. It is not inconceivable that lexical diffusion effects may alter the production of the vowel in this word but not in lower frequency words. Because the vowel /æ/ occurs frequently in this context, this may distort the predictions made by the model. However, because consonant-context and vowel parameters are fit independently, the likelihood that individual content words will have a substantial effect on fitted estimates is somewhat mitigated. Conversely, however, it may be possible to test theories of lexical diffusion by comparing model predictions directly with observed formant tracks similar to what we have done in Fig. 5. This could be accomplished in a manner similar to that described above for examining Canadian Raising or dialectal variation.

Several issues have come to our attention in the course of this work that may be useful for others planning similar studies. The phoneme segmentation algorithms depend critically on how accurately the phonemic dictionary corresponds to the dialect of English to be analyzed. In the current work, several such errors were detected at an early stage in the study which could have caused significant deviations in the data. These errors were detected because they pertained to high-frequency words. For example, the primary stressed vowel in the word because and its variant 'cause was given as /ɔ/. However, in this dialect, the most appropriate phonemic transcription would be /bɪˈkʌz/ and not /bɪˈkɔz/. As this is a very high-frequency word, it erroneously reported a sharp distinction between /hɔd/ and /hɑd/ in spontaneous speech which was very unexpected given that there was no such distinction made in citation speech—it was only after investigating the cause of this deviation that the inappropriate entry in the dictionary was discovered. Ultimately, it was decided to omit this word entirely from the database as it serves as a grammatical function word. If it were a lexical word, we could also have modified the pronunciation dictionary. However, it is highly likely that many other errors exist with effects small enough to escape detection and fixing these issues is an ongoing process.

Even in the absence of outright transcription errors, very high frequency words, especially in the case of rarer phonemes, may have an undue influence on model statistics. For example, in the present study, variants of the word talk (talking, talked, etc.) account for 20% of the total purported /ɔ/ tokens in the dataset which has the potential to bias the data substantially. Small but consistent differences in F2 between /ɔ/ and /ɑ/ were shown in Figs. 5 and 6 and this may have been due to the presence of some speakers who did differentiate these two vowels in certain words. Anecdotally, we found at least three speakers from the South Shore of Nova Scotia who did produce a clear /ɔ/ in words such as talk and water in spontaneous speech but not in citation speech.

As for the modeling itself, we hope to continue developing more strongly parametric models which introduce constraints on the independence of trajectories across time frames such as those found in the models of Broad and Clermont (1987, 2014). Initial work on these has taken two main paths. The first follows on the work of Nearey (2013) which generalizes the Broad and Clermont models to handle stylized dual-target vowels but otherwise maintain the key parametric constraint that consonant-context effects follow an exponential shape towards either the onset or coda. The second path involves constraining the vowel formant trajectories via semiparametric modelling using monotonic or penalized spline functions. While we have encountered some promising results with highly balanced laboratory speech, so far efforts to fit to highly unbalanced spontaneous speech samples have not been satisfactory.

This work was supported by the Social Sciences and Humanities Council of Canada. The authors would like to thank David Broad, the Associate Editor, and an anonymous reviewer for their valuable comments on an earlier draft of this work.

1

We use the qualifier “apparent” because the methods we develop here require that we adopt a reference consonantal frame. Since we expect that the consonants of that frame contribute somewhat to the formant patterns at 20% and 70% of duration, we cannot claim that movement patterns shown represent only vowel-inherent change. See Nearey (2013) for a discussion. Nonetheless, differences in movement between pairs of vowels (especially those that are near neighbors in the F1F2 space) are strongly suggestive of differences of vowel inherent formant movement patterns.

2

The analysis followed the convention of the CMU Pronunciation Dictionary and P2FA of treating the affricates as single speech sounds. In addition, the phoneme /h/ never occurred in coda position while /ʒ/ never occurred in onset position leaving 16 possible onsets and 16 possible codas.

3

The nonlinear regression uses the trust-region-reflective algorithm as implemented by the lsqnonlin function in matlab.

4

Such a linear relation is indeed implicit from, e.g., model IVa of Broad and Clermont (1987) for /CV/ syllables via their Eq. (35) and for /VC/ syllables via their Eq. (36) on p. 162 when their time index n is fixed at a single slice. This issue of locus equations is also discussed explicitly in Broad and Clermont (2014).

1.
Adank
,
P.
(
2003
). “
Vowel normalization: A perceptual-acoustic study of Dutch vowels
,” Doctoral thesis,
University of Nijmegen
,
Nijmegen, the Netherlands
.
2.
Andruski
,
J. E.
, and
Nearey
,
T. M.
(
1992
). “
On the sufficiency of compound target specification of isolated vowels in /bVb/ syllables
,”
J. Acoust. Soc. Am.
91
,
390
410
.
3.
Assmann
,
P. F.
, and
Katz
,
W. F.
(
2000
). “
Time-varying spectral change in the vowels of children and adults
,”
J. Acoust. Soc. Am.
108
,
1856
1866
.
4.
Bladon
,
A.
(
1982
). “
Arguments against formans in the auditory representation of speech
,” in
The Representation of Speech in the Peripheral Auditory System
, edited by
R.
Carson
and
B.
Granström
(
Elsevier
,
Amsterdam
), pp.
95
102
.
5.
Bladon
,
R. A. W.
, and
Lindblom
,
B.
(
1981
). “
Modeling the judgement of vowel quality differences
,”
J. Acoust. Soc. Am.
69
,
1414
1422
.
6.
Boberg
,
C.
(
2010
).
The English Language in Canada
(
Cambridge University Press
,
Cambridge
).
7.
Boersma
,
P.
(
2001
). “
Praat, a system for doing phonetics by computer
,”
Glot Int.
5
,
341
345
.
8.
Broad
,
D. J.
, and
Clermont
,
F.
(
1987
). “
A methodology for modeling vowel formant contours in CVC context
,”
J. Acoust. Soc. Am.
81
,
155
165
.
9.
Broad
,
D. J.
, and
Clermont
,
F.
(
2014
). “
A method for analyzing the coarticulated CV and VC components of vowel-formant trajectories in CVC syllables
,”
J. Phon.
47
,
47
80
.
10.
Chambers
,
J. K.
(
1973
). “
Canadian raising
,”
Can. J. Ling.
18
,
113
135
.
11.
Clarke
,
S.
,
Elms
,
F.
, and
Youssef
,
A.
(
1995
). “
The third dialect of English: Some Canadian evidence
,”
Lang. Var. Change
7
,
209
228
.
12.
Fisher
,
W. M.
,
Doddington
,
G. R.
, and
Goudie-Marshall
,
K. M.
(
1986
). “
The DARPA speech recognition research database: Specifications and status
,” in
Proc. DARPA Workshop on Speech Recognition
, pp.
93
99
.
13.
Hillenbrand
,
J.
,
Getty
,
L. A.
,
Clark
,
M. J.
, and
Wheeler
,
K.
(
1995
). “
Acoustic characteristics of American English vowels
,”
J. Acoust. Soc. Am.
97
,
3099
3111
.
14.
Hillenbrand
,
J. M.
(
2003
). “
American English: Southern Michigan
,”
J. Int. Phon Assoc.
33
,
121
126
.
15.
Hillenbrand
,
J. M.
,
Clark
,
M. J.
, and
Nearey
,
T. M.
(
2001
). “
Effects of consonant environment on vowel formant patterns
,”
J. Acoust. Soc. Am.
109
,
748
763
.
16.
Hillenbrand
,
J. M.
,
Houde
,
R. A.
, and
Gayvert
,
R. T.
(
2006
). “
Speech perception based on spectral peaks versus spectral shape
,”
J. Acoust. Soc. Am.
119
,
4041
4054
.
17.
Hosom
,
J.-P.
(
2009
). “
Speaker-independent phoneme alignment using transition-dependent cues
,”
Speech Commun.
51
,
352
368
.
18.
Kiefte
,
M.
, and
Kay-Raining Bird
,
E.
(
2010
). “
Canadian Maritime English
,” in
The Lesser-Known Varieties of English
, edited by
D.
Schreier
,
P.
Trudgill
,
E. W.
Schneider
, and
J. P.
Williams
(
Cambridge University Press
,
Cambridge
), pp.
59
71
.
19.
Kiefte
,
M.
, and
Kluender
,
K. R.
(
2005
). “
The relative importance of spectral tilt in monophthongs and diphthongs
,”
J. Acoust. Soc. Am.
117
,
1395
1404
.
20.
Kiefte
,
M.
, and
Kluender
,
K. R.
(
2008
). “
Absorption of reliable spectral characteristics in auditory perception
,”
J. Acoust. Soc. Am.
123
,
366
376
.
21.
Kiefte
,
M.
, and
Nearey
,
T. M.
(
2015
). “
Modeling consonant-context effects in dialectal variation in a large database of spontaneous speech recordings
,”
J. Acoust. Soc. Am.
138
,
1923
.
22.
Kiefte
,
M.
,
Nearey
,
T. M.
, and
Assmann
,
P. F.
(
2013
). “
Vowel perception in normal speakers
,” in
Handbook of Vowels and Vowel Disorders
, edited by
M. J.
Ball
and
F. E.
Gibbon
(
Psychology Press
,
New York
), pp.
160
185
.
23.
Kluender
,
K. R.
,
Stilp
,
C. E.
, and
Kiefte
,
M.
(
2013
). “
Perception of vowel sounds within a biologically realistic model of efficient coding
,” in
Vowel Inherent Spectral Change
, edited by
G. S.
Morrison
and
P. F.
Assmann
(
Spring-Verlag
,
Berlin
), pp.
117
151
.
24.
Labov
,
W.
,
Ash
,
S.
, and
Boberg
,
C.
(
2006
).
Atlas of North American English: Phonetics, Phonology and Sound Change
(
Mouton de Gruyter
,
Berlin
).
25.
Labov
,
W.
,
Yaeger
,
M.
, and
Steiner
,
R.
(
1972
).
A Quantitative Study of Sound Change in Progress
(
US Regional Survey
,
Philadelphia
), Vol.
1
.
26.
Lee
,
K.-F.
(
1989
).
Automatic Speech Recognition: The Development of the SPHINX System
(
Kluwer
,
Boston
).
27.
Nearey
,
T. M.
(
1989
). “
Static, dynamic, and relational properties in vowel perception
,”
J. Acoust. Soc. Am.
85
,
2088
2113
.
28.
Nearey
,
T. M.
(
2013
). “
Vowel inherent spectral change in the vowels of North American English
,” in
Vowel Inherent Spectral Change
, edited by
G. S.
Morrison
and
P. F.
Assmann
(
Springer
,
Heidelberg
), pp.
49
85
.
29.
Nearey
,
T. M.
, and
Assmann
,
P. F.
(
1986
). “
Modeling the role of inherent spectral change in vowel identification
,”
J. Acoust. Soc. Am.
80
,
1297
1308
.
30.
Nearey
,
T. M.
,
Assmann
,
P. F.
, and
Hillenbrand
,
J. M.
(
2002
). “
Evaluation of a strategy for automatic formant tracking
,”
J. Acoust. Soc. Am.
112
,
2323
.
31.
Peterson
,
G. E.
, and
Barney
,
H. L.
(
1952
). “
Control methods used in a study of the vowels
,”
J. Acoust. Soc. Am.
24
,
175
184
.
32.
Rogers
,
H.
(
2000
).
The Sounds of Language
(
Longman
,
London
).
33.
Rosner
,
B. S.
, and
Pickering
,
J. B.
(
1994
).
Vowel Perception and Production
(
Oxford University Press
,
Oxford
).
34.
Sussman
,
H. M.
,
Fruchter
,
D.
,
Hilbert
,
J.
, and
Sirosh
,
J.
(
1998
). “
Linear correlates in the speech signal: The orderly output constraint
,”
Behav. Brain Sci.
21
,
241
299
.
35.
Wells
,
J. C.
(
1982
).
Accents of English: Beyond the British Isles
(
Cambridge University Press
,
Cambridge
), Vol.
3
.
36.
Yuan
,
J.
, and
Liberman
,
M.
(
2008
). “
Speaker identification on the SCOTUS corpus
,”
J. Acoust. Soc. Am.
123
,
3878
.
37.
Zhang
,
C.
,
Morrison
,
G. S.
,
Enzinger
,
E.
, and
Ochoa
,
F.
(
2013a
). “
Effects of telephone transmission on the performance of formant-trajectory-based forensic voice comparison—female voices
,”
Speech Commun.
55
,
796
813
.
38.
Zhang
,
C.
,
Morrison
,
G. S.
,
Ochoa
,
F.
, and
Enzinger
,
E.
(
2013b
). “
Reliability of human-supervised formant-trajectory measurement for forensic voice comparison
,”
J. Acoust. Soc. Am.
133
,
EL54
EL60
.