Regional variation in American English speech is often described in terms of shifts, indicating which vowel sounds are converging or diverging. In the U.S. South, the Southern vowel shift (SVS) and African American vowel shift (AAVS) affect not only vowels' relative positions but also their formant dynamics. Static characterizations of shifting, with a single pair of first and second formant values taken near vowels' midpoint, fail to capture this vowel-inherent spectral change, which can indicate dialect-specific diphthongization or monophthongization. Vowel-inherent spectral change is directly modeled to investigate how trajectories of front vowels /i eɪ ɪ ɛ/ differ across social groups in the 64-speaker Digital Archive of Southern Speech. Generalized additive mixed models are used to test the effects of two social factors, sex and ethnicity, on trajectory shape. All vowels studied show significant differences between men, women, African American and European American speakers. Results show strong overlap between the trajectories of /eɪ, ɛ/ particularly among European American women, consistent with the SVS, and greater vowel-inherent raising of /ɪ/ among African American speakers, indicating how that lax vowel is affected by the AAVS. Model predictions of duration additionally indicate that across groups, trajectories become more peripheral as vowel duration increases.
I. FRONT VOWEL SHIFTING IN SOUTHERN U.S. ENGLISH
Dialects of the Southern United States have a variety of distinguishing features, among them the Southern vowel shift (SVS) (Labov et al., 1972; Labov et al., 2006; Fridland, 2001; Thomas, 2003), and the African American vowel shift (AAVS) (Thomas, 2007; Kohn, 2013). The SVS is a chain shift, argued to begin with monophthongization of /aɪ/ to [a:], that affects front vowels /i eɪ ɪ ɛ æ/; additionally, back vowels /u, oʊ/ are fronted in many Southern varieties. The AAVS shares some features with the SVS, but is distinct: the lax vowels /æ ɛ ɪ/ are raised and fronted, but the tense vowels are not, and back /ɑ/ is fronted; while fronting of /u oʊ/ can occur, it is rarer or less extensive than in the SVS (Thomas, 2007). The front vowels in particular realize their shifting through diphthongization and vowel-inherent spectral change (VISC) [see Morrison and Assmann (2013) for recent discussion]. However, the formant dynamics of Southern varieties are insufficiently understood, and little quantitative work has directly compared characteristics of vowels influenced by the SVS and AAVS.
In this paper, we test the hypothesis that the acoustics of front vowels spoken by Southern European Americans (EAs) and African Americans (AAs) differ significantly in their formant dynamics. While previous models of VISC in Southern speech rely on straight-line summaries of formant contours, we innovate by directly modeling their nonlinear properties using generalized additive mixed models (GAMMs) (Wood, 2017a; Winter and Wieling, 2016). We focus on a large corpus recorded in the mid-20th century, the Digital Archive of Southern Speech (DASS) (Kretzschmar et al., 2013), where shifting is expected to be widespread due to both the corpus' age (the SVS and AAVS strengthened over the first half of the 20th century; Thomas, 2005, 2007) and the geographic area it represents. As language change took place via the spread of these shifts, it is possible that men and women adopted their characteristics at different rates; thus, we also test the relationship of speaker sex to vowel acoustics. We identify cross-demographic speech trends that arose as the SVS and AAVS reached their peak, providing a basis of comparison for newer analyses, because change is on-going: the SVS is known to be weakening in the 21st century, especially in certain urban areas (Thomas, 1997; Dodsworth and Kohn, 2012; Dodsworth and Benton, 2017). In the remainder of this section, we motivate our hypotheses by describing the SVS, AAVS, and previous work on vowel dynamics, particularly within Southern U.S. speech.
A. Characteristics of front vowels in the SVS
Descriptions of white Southern speech stoke expectations of diphthongal, dynamic vowels. The high tense vowel /i/ (fleece, in terms of lexical sets, Wells, 1982) is slightly diphthongal, as in [i], with a retracted onset gliding to a high front vowel, or even [ɪi] (Thomas, 2005). Although geographic variation occurs, /eɪ/ (face) can lower in its initial portion, to [εi] or an even more retracted position, particularly in areas of the South where diphthongal /aɪ/ (prize/price) is entirely eliminated (Thomas, 2005). The lax vowels shift upwards to a tensed position, and can also break into diphthongs or triphthongs: high lax /ɪ/ (kit) can appear as [i] or [iə], though this is unlikely in unstressed syllables, where [ɪ] is retained (Thomas, 2005). Mid lax [ɛ] (dress) can also raise towards tense [e] or [eə], or to the triphthong [eiə] in certain emphatic contexts (Thomas, 2005).
The loss of monophthongal /e/ was an early change in white Southern speech, occurring since the late 19th century (Thomas, 2005). Since weakening of the glide in /aɪ/ also began at that time, and is argued to trigger the SVS (Labov et al., 2006), it is reasonable to assume that the majority of these changes have gained strength since the late 19th and through the 20th Centuries (for more on /aɪ/ in DASS data see Renwick and Stanley, 2017; Olsen et al., 2018).
Descriptions of front-vowel reversals are borne out by static acoustic measurements, while more recent research has investigated vowel dynamics. Static, single-point measurements are common despite widespread knowledge that the “Southern drawl” (Bailey, 1968; Thomas, 2001, 2003; Wells, 1982) can include multiple acoustic targets and “involves the addition of a second or even a third vowel” (Montgomery, 1989, p. 761). Many studies rely on a single pair of formant measurements taken at the vowel midpoint or indicating the central tendency, “the main trajectory of the tongue during its articulation” (Labov et al., 2006, p. 38). In such analyses, summary statistics are calculated to quantify SVS participation. Studies measuring EA Southern speech at a different time point, such as one-third of vowel duration, suggest strong overlap of /e ɪ/ instead of /e ɛ/ (Clopper et al., 2005, p. 1672). These conflicting findings indicate the inadequacy of static analysis methods, because the vowels concerned do not consist of just one steady state.
Measures of VISC and analyses of vowel timing have shown that the dynamics of Southern speech are distinct from other regions of the US. Fox and Jacewicz (2009) compare speech from North Carolina to samples from Ohio and Wisconsin. The measures of trajectory length and spectral rate of change, calculated across five formant samples per vowel, are much longer for NC speakers particularly in /ɪ ɛ eɪ/, but they are shorter in monophthongized /aɪ/ (see Sec. I C for further discussion). Farrington et al. (2018) build on this work using three-point trajectories, which they argue are a “more accessible measure for typical sociophonetic analyses that extract formant measures only at the onset, midpoint, and offset points” (Farrington et al., 2018, p. 197). They use /eɪ ɛ/ Euclidean distance as a proxy for Southern shifting, as a smaller Euclidean distance between /eɪ ɛ/ is found to index Southern speech in production and perception (Fridland et al., 2014; Fridland and Kendall, 2012; Kendall and Fridland, 2012). They show that measures of trajectory length and spectral rate of change correlate well with individual speakers' participation in the SVS. However, Farrington et al. (2018) find few distinctions between their Southern group (from Tennessee, North Carolina, and Virginia) and Western speakers from Nevada.
Durational properties of vowels also vary across regions. Fox and Jacewicz (2009) show that NC vowels are longer than OH or WI vowels. Fridland et al. (2013, 2014) find that Southern (Memphis, TN) speakers have a smaller duration difference between /eɪ, ɛ/ than speakers from other regions. Speakers with the smallest /eɪ ɛ/ Euclidean distance tend to have the smallest duration differences; the opposite pattern is found in Western and Northern speakers, where these are negatively correlated. This suggests that “southern tense and lax vowels would appear to be heading toward F1/F2 and durational merger” (Fridland et al., 2014, p. 347), but the authors acknowledge that formant dynamics help to maintain the contrast. They conclude that “duration is simply not a particularly useful cue in the Southern high front system […]. Instead of being a primary cue for these vowels, duration, likely along with spectral change over time, may be part of a package of acoustic distinctions that signals both dialect and vowel category information” (Fridland et al., 2014, p. 348). Indeed, timing appears to be treated differently by Southerners than elsewhere in the U.S.; controlled studies show longer durations of SVS vowels (Clopper et al., 2005; Fox and Jacewicz, 2009) alongside a slower articulation rate and long pauses (Clopper and Smiljanic, 2015).
As Farrington et al. (2018, p. 187) acknowledge, “very little linguistic work on Southern speech has focused on dynamics.” In particular, we lack an understanding of Southern vowels as curves, with nonlinear qualities, using methods that integrate formant movement with duration. A similar lacuna affects our understanding of the AAVS.
B. Characteristics of the AAVS
While the extent of the SVS is geographically restricted, patterns in AA speech may reflect both regional and pan-regional characteristics. The AAVS shares some properties with the SVS, but maintains important distinctions. For instance, “swapping” of the high front vowels is argued to be quite rare (Thomas, 2007, p. 462), while the mid front vowels may undergo this change relatively more frequently (Fridland, 2003; Thomas, 2001), particularly via lowering of /eɪ/ to [ɛɪ] in the face vowel but also raising of /ɛ/. This variable, but nonetheless possible, swapping of non-low front vowels leads to a different formulation of the AAVS by Kohn (2013), which includes /ɛ/ raising alongside /ɪ/ raising, and the centralization of /i, eɪ/.
Comparing AA speech to EA patterns, AA speakers have been found to have more monophthongal pronunciations of certain vowels, including the face vowel, that is, [e:] rather than [eɪ] (Dorrill, 1986). This is particularly true of speakers born in the 19th century, or at least prior to World War I; monophthongal realizations of this vowel, as well as /o/ in goat, are much more rare in younger speakers (Thomas, 2007, p. 458). In recorded interviews with formerly enslaved individuals, Thomas and Bailey (1998) find that the onsets and offsets of /e/ vowels overlap in acoustic space, rather than gliding to another quality. They propose that the origins of these monophthongs lie in the West African languages spoken by enslaved individuals brought to the United States (Thomas and Bailey, 1998). The dynamics of lax vowels also vary ethnolectically, for instance, among younger speakers in Piedmont, North Carolina, where AA speakers have less-diphthongal front lax vowels than EAs (Risdal and Kohn, 2014).
Acoustic studies demonstrate that the AA vowel space is anything but uniform or monolithic. Some researchers have shown that AA speakers are consistent with one another, having patterns distinct from the local EA variety; others have illustrated multiple speaker strategies within a community; and a third line of investigation finds that AA speakers accommodate to local, non-Black norms. Fridland (2003) shows that individual speakers implement their vowel system in a variety of ways. One has a “traditional” placement of /i ɪ/, with /eɪ ɛ/ reversed; a second has front /i/, centralized /ɪ/, and /eɪ ɛ/ lie close to one another; in a third, /i/ is front and /eɪ/ is centralized, while /ɪ ɛ/ have the same intermediate frontness and differ only in F1; and in a fourth speaker's system, /i ɪ ɛ/ all have the same F2, differ only in F1, and /eɪ/ is relatively centralized. Fridland argues that the details of these patterns are linked to the strength of speakers' social connections to AA versus EA communities in Memphis, Tennessee (2003). Kohn and Farrington (2013) find that AA speakers' front vowels vary geographically between Durham, North Carolina and nearby Chapel Hill, but also according to speakers' education level; there is a tendency toward increased raising of /ɛ/ particularly among Durham speakers who do not attend a four-year university, with limited evidence of /ɪ/ raising. Both Fridland and Kohn and Farrington show that the systems of Black speakers are different from white speakers, and Andres and Votta (2009) similarly find racial distinctions among speakers from Roswell, Georgia. Among their speakers who overlap in age with those in our study, AA speakers' /i ɪ eɪ ɛ/ are more peripheral than EA speakers', and AA speakers' /eɪ ɛ/ do not reverse, while the EA speakers' do (Andres and Votta, 2009).
Variation can also follow local norms. Data from four small, rural North Carolina communities—some coastal, some Appalachian—revealed similarities in the front vowel systems of AA and EA speakers at the local level (Childs et al., 2009). Holt (2018) uses time-varying formant data to compare AA and EA speakers in western vs eastern North Carolina. Results show that AA speakers from western NC participate partially in the SVS while maintaining some distinct AA features, but eastern speakers participate partially in the AAVS only (Holt 2018). In Louisiana, where AA English is in contact with Cajun French, Creole French, and Cajun Vernacular English, both EA and AA speakers may have monophthongal /i, eɪ/ vowels, where /eɪ/ can raise past /ɪ/, while /ɛ/ lowers toward /æ/ (Wroblewski et al., 2009). This is argued to be a “local pattern presently restricted to southern Louisiana” (Wroblewski et al., 2009, p. 60), where two of DASS's AA speakers hailed from.
In addition to synchronic variation, AA speech has undergone considerable change during the span of time represented by speakers in the DASS corpus. For instance, “happY-tensing” (Wells, 1982) affected AA speech. The final vowel in words like happy was formerly [ɪ] across dialects of English, and it remained relatively centralized in the speech of AAs born in the late 19th and early 20th Centuries, but has become increasingly high and front among speakers born after World War I and in the mid-20th century (Denning, 1989). Although we do not study /æ/ here, it raises within the AAVS, among speakers born since the late 19th century (Thomas and Bailey, 1998).
We contribute new knowledge of AA speech via a considerable injection of new data. Historical recordings of AA speech are comparatively rare, and those analyzed here have not yet been explored in acoustic detail, and are truly historical: of the 16 AA speakers in DASS, 10 were born prior to World War I, which is Thomas's (2007, p. 458) cutoff for the presence of monophthongal face and goat vowels. The speakers represent geographic areas of the South that are understudied especially with respect to North Carolina and Memphis, on which much recent work has focused. The variety's vowel dynamics also lack quantitative examination.
Based on the descriptive and phonetic evidence reviewed here, we expect to find differences in the vowel-inherent formant dynamics, and relative vowel placements, of AA and EA speakers. We test for effects of ethnicity and sex on the shape of a vowel's formant trajectory, but first we situate our work with respect to other studies of vowel formant dynamics.
C. Quantifying the dynamics of vowel trajectories
While traditional descriptions of the SVS and AAVS are based on single-point measurements, researchers are increasingly addressing the dynamic characteristics of vowels, often referred to as VISC (Nearey and Assmann, 1986; Morrison and Assmann, 2013), in varieties of English including those spoken in the Southern U.S. We review two families of models used to address spectral change in general, and Southern vowel dynamics in particular. One element that these methods share is the measurement of formant values—minimally F1 and F2—at multiple points in a vowel's trajectory.
For one set of models, the impetus for treating vowels as dynamic entities stems from the idea that VISC is perceptually relevant; that is, listeners use acoustic change over time to decode speech, rather than attending to a single auditory target near the vowel midpoint (Hillenbrand et al., 1995). This has motivated models quantifying the amount of spectral change over time using the cumulative measures of vector length (VL) (Ferguson and Kewley-Port, 2002) or trajectory length (TL) (Fox and Jacewicz, 2009). TL in particular was found to vary systematically across regional varieties of American English, increasing in cases of diphthongization (for instance, Southern versus Midwestern /eɪ, ɪ/) or decreasing in cases of monophthongization, as in Southern /aɪ/ (Fox and Jacewicz, 2009; Jacewicz et al., 2011; Jacewicz and Fox, 2013; Holt, 2018; Olsen et al., 2018; cf. Farrington et al., 2018 with a three-point trajectory). Although VL and TL, alongside measures of Euclidean distance and spectral rate of change (Fox and Jacewicz, 2009; Farrington et al., 2018), are derived from multiple pairs of F1,F2 measurements, they reveal only how much a vowel's formants are changing, without characterizing where they are going. TL and VL can also be disrupted by changes in formant direction (such as a reversal in F1,F2 trajectory), leading to inappropriately short measurements. Perhaps more problematically for models of vowel dynamics, the specification of a trajectory using three points (as in Farrington et al., 2018) does not uniquely describe a single curve, meaning that their method fails to distinguish between curves that have different shapes within F1,F2 space. We illustrate this in Fig. 1, below, with three hypothetical formant trajectories for the vowel /eɪ/. These two parabolas, and one nonlinear curve, all have differently shaped trajectories and nearly identical lengths. They pass through the same F1,F2 coordinates at their 20%, 50%, and 80% points (the rate of spectral change is not assumed to be uniform). The shaded region, identical across panels, outlines a triangle whose short sides represent VL Onset and VL Offset, summing to produce TL, while the long side's length is the vowel's VL. We argue that reduction to these three points obscures phonetic differences that may be perceptually relevant. We expect that formant movements between the 20% and 50% marks, and between the 50% and 80% marks, carry pertinent dialectal and social information.
An alternative family of VISC models treats formant trajectories as curves. Discrete cosine transform coefficients, based on dense formant sampling, have been used to improve models distinguishing vowels in American English (Zahorian and Jagharghi, 1993), tense from lax vowels in Australian English (Cox and Palethorpe, 2019; Watson and Harrington, 1999; Williams et al., 2018), and to account for patterns in vowel perception across varying consonantal environments (Hillenbrand et al., 2001). Studies of regional variation have also benefited from functional data analysis, in which a polynomial function is fitted to formant data from multiple time points, and the resulting coefficients are compared and modeled across varieties or contexts (Koops, 2014; Risdal and Kohn, 2014; Renwick and Olsen, 2017).
Statistical modeling of curves is also possible with smoothing spline analysis of variance (SS-ANOVA) which has been applied to ultrasound data (Davidson, 2006), nasalization measurements (Carignan, 2017), and formant trajectories (De Decker and Nycz, 2006; Docherty et al., 2015; Strycharczuk and Scobbie, 2016). Now a special case of SS-ANOVA, generalized additive mixed modeling (GAMM), is gaining ground. GAMMs straightforwardly incorporate nonlinear predictors, and they are used to directly compare two curves, to determine whether they are statistically distinct in height, shape, or both. The models fitted to these curves have applications in phonetics for the study of pitch contours (Kösling et al., 2013) and articulatory movements (Wieling et al., 2016; Wieling, 2018; Tomaschek et al., 2018a, 2018b) including tongue shape (Strycharczuk and Scobbie, 2017; Noiray et al., 2019). They are applied to vowel trajectories (Sóskuthy et al., 2018; Stanley, 2020), apparent-time changes in Philadelphia English (Fruehwald, 2017), and the relationship of vowel space size to talker age (Gahl and Baayen, 2019). Increasingly, GAMMs are applied to time-varying sociolinguistic data consisting of multiple measurements per token, particularly to show change in acoustics or articulation across generations (Cole and Strycharczuk, 2019; Sóskuthy et al., 2019). Strycharczuk and Scobbie (2017) demonstrate /u/-fronting in Scottish English by comparing tongue shapes with acoustic data, while Sóskuthy et al. (2018) demonstrate that /u/-fronting in Derby English, measured via vowel trajectories, is affected by speaker age, phonological context, and word frequency. Because GAMMs permit the direct comparison of two curves, and are suitable for analyzing time course data, they are applied here to vowel trajectories from Southern U.S. speech.
D. Research questions
Our study models front vowel trajectories in DASS, a corpus described in Sec. II. By sampling formant values at multiple points, and modeling them as curves, we can test claims that the SVS and AAVS have characteristically different degrees of diphthongization and front vowel shifting.
We hypothesize that there is an effect of speaker ethnicity on vowel characteristics, related to the adherence of EA speakers to the SVS, as opposed to the development of the AAVS among AA speakers. We predict that among EA speakers, the trajectories of /i ɪ/ occupy positions consistent with centralization of /i/ and peripheralization of /ɪ/, and that the same is true of /eɪ ɛ/. We predict that for AA speakers, the trajectories of /ɪ, ɛ/ occupy positions consistent with their peripheralization. Although all four of these front vowels are diphthongal in most Southern varieties, it is an open question how or whether the social groups in our sample implement the vowels differently from a dynamic perspective. The SVS triggers front-lax diphthongization alongside diphthongal tense vowels, while vowels in AA speech were historically monophthongal (Thomas, 2007), and we predict that GAMMs will detect these differences in formant dynamics.
The two systems also have different degrees of shifting in their tense and lax vowels. This may affect whether the trajectories of two vowels overlap, or cross, in F1,F2 space. In the SVS, tense /i eɪ/ lower and centralize, but lax /ɪ ɛ/ raise and front, and some speakers may exhibit “reversals” of high /i ɪ/ and mid /eɪ ɛ/ (Labov et al., 2006). For AAs, two formulations of the AAVS exist. In the version of Thomas (2007), the lax vowels /ɪ ɛ/ are raised and fronted (peripheralized), while the tense vowels do not shift (and may remain monophthongal). For Kohn (2013, p. 62), lax /ɛ/ is peripheralized, while lax /ɪ/ and the tense vowels shift more rarely. Because the tense and lax vowels are shifting towards one another in these varieties, both the SVS and AAVS may lead to overlap in tense/lax formant trajectories, which will be larger than would occur in dialects lacking these shifts. We predict that overlapping trajectories are most likely in EA speech, because both tense and lax vowels are expected to shift in the SVS. In the AAVS we expect /ɛ/ to shift, and overlap with [eɪ], while other front vowels either do not shift (following Thomas, 2007), or they may shift less frequently, or to a smaller extent (following Kohn, 2013). Overall we predict that tense vowels in particular have different absolute positions and dynamics in EA versus AA speech, and that the formant trajectories of tense and lax vowels overlap less in AA than EA speech, meaning that they occupy more distinct regions of the vowel space.
From an analysis with GAMMs, we expect significant effects of both ethnicity and sex on the F1,F2 values of front vowels. If F1,F2 of each social group differ not only in overall formant values but also in trajectory shape, then the inclusion of nonlinear statistical predictors will improve the model fit of each GAMM, as determined by model comparison. By combining the results of separate models, we visually evaluate whether the trajectories of different vowels overlap in F1,F2 space. In a static analysis, the crossing of F1 and F2 could be interpreted as overlap or reversal via a static analysis, but by incorporating multiple F1,F2 measurements we expect to show that their distinction is maintained.
II. THE DIGITAL ARCHIVE OF SOUTHERN SPEECH
DASS is an audio corpus of semi-spontaneous linguistic atlas interviews (Kretzschmar et al., 2012) of 64 American speakers native to eight US states. These interviews, a subset of the Linguistic Atlas of the Gulf States (LAGS) (Pederson et al., 1986), include 34 men (10 AA) and 30 women (6 AA), born 1886–1965. The speakers represent a mixture of ethnicities, social classes, education levels, and ages. Data from all speakers are available, though varying in quality. Figure 2 illustrates the sex, ethnicity,1 and geographic origin of each interviewee. For details on the recent transcription and forced alignment of the corpus (Kretzschmar et al., 2019), see Olsen et al. (2017). The acoustic data may be plotted and viewed, alongside more detailed speaker demographic information, in the Gazetteer of Southern Vowels (Stanley et al., 2017).
The dynamic nature of Southern vowels was not lost on transcribers who worked with these recordings, generating detailed phonetic transcriptions for a subset of target lexical items. For example, in the limited transcriptions of the LAGS and DASS that accompanied the recordings, many vowels that are canonically monophthongal in most varieties of English are transcribed with diphthong-like offglides: [tɹæɛk] “track,” [ˈmæɛtrɨs]2 “mattress” (Speaker 195), [ʃɹɪəmp] “shrimp,” [ˈbɪəznɪs] “business” (Speaker 197), [ˈfɔətɨ] “forty” (Speaker 200), [fɹɑəg] “frog” (Speaker 198). These sequences are crucial to capturing the vowel patterns of these Southern speakers.
A. Data and data preparation
We analyze the very large dataset available from DASS, which contains 878 660 total vowel tokens that were aligned and measured with Dartmouth Linguistic Automation (DARLA) (Reddy and Stanford, 2015a, 2015b). This Internet-based pipeline carries out forced alignment, creating a Praat TextGrid (Boersma and Weenink, 2017) labeled at the segment and word levels and aligned in time to each .wav file. Prior to forced alignment, manual orthographic transcriptions were checked three times for typos and out-of-dictionary words. Files that caused errors during forced alignment were checked a fourth time, which often resulted in a manual adjustment of TextGrid intervals available to the forced aligner, resolving aligner crashes. With a corpus this size, manual correction of all phoneme-level alignment is not feasible, and manual validation of the force-aligned boundaries has not yet been undertaken. Nevertheless, we rely on the output of automatic methods because studies that validate these approaches find that the difference between formant measurements based on automatically aligned boundaries and those extracted from manually aligned TextGrids is minimal (Evanini, 2009, pp. 50–94). Such work also shows that means at the group level—which is how they are presented in this paper—are largely unaffected by the presence of random error (Strelluf, 2019).
DARLA extracts acoustic characteristics using the FAVE software suite (Rosenfelder et al., 2014). Values for the first two formants (F1, F2) were extracted by DARLA at five time points: 20%, 35%, 50%, 65%, and 80% of vowel duration. While most studies using GAMMs rely on a greater number of measurements per vowel, typically 10 or more, it is not currently feasible for our research team to extract acoustics at time points other than those listed here. Our data analysis includes each token's duration, but data points are referred to by the percentage at which they were measured. Thus, our models explore the extent to which formant curves are geometrically similar, given a normalized, homogeneous scaling of the time dimension.
The data from 62 speakers were analyzed3; measurements from speaker 165, an AA female, and speaker 850, a EA male, were excluded because visual inspection of an F1,F2 plot of their vowel spaces revealed many very unlikely formant values, probably due to poor audio quality. The dataset was limited to stressed vowels (stress is assigned by DARLA based on its lexicon), and was acoustically filtered to exclude outliers, which occur despite FAVE's methods for selecting an optimal formant track. Most commonly, outliers are measurements from /u/ or /o/ in which both F1 and F2 are unusually high, likely due to formant tracking errors. Filtering by Mahalanobis distance (Mahalanobis, 1936; cf. Labov et al., 2013) was carried out at each of the five measurement points. This value is akin to a scale-free Euclidean distance quantifying the straight-line distance between a data point and the center of the ellipse to which it belongs. Distances were calculated for each measurement point, for each formant, for each token, relative to a speaker- and vowel-specific centroid. Tokens with high Mahalanobis distance (based on 95% quantile of a χ2 distribution) are excluded as outliers (159 864 measurements in this case, or 5.1% of 3 152 390 measurements subjected to filtering). This removes the grossest errors that result from the automatic alignment and formant extraction procedures.
After outlier removal, all formant values were scaled to Barks (Traunmüller, 1990), using the normVowels() function in the phonR package (McCloy, 2016) and following the methods for GAMMs in Gahl and Baayen (2019). This scaling method was chosen because it transforms F1 and F2 values from a logarithmic Hz-based scale to a linear scale, which is ideal for the dependent variable in a regression model, unlike the Lobanov normalization procedure (Lobanov, 1971).4 Bark scaling puts F1 and F2 on the same scale, meaning differences in magnitude (for example, 0.25 Barks) are perceptually comparable for F1 and F2, which allows them to be pooled as a single dependent variable (see Sec. III B). Bark transformation is not intended to eliminate formant range differences, or biologically motivated formant differences, but simply to place all values on a single scale that approximates acoustic distances in human perception (cf. discussion in Clopper, 2009).5
The dataset was limited to stressed front vowels that do not precede nasals or liquids, avoiding anticipatory vowel-sonorant coarticulatory effects and certain vowel mergers conditioned by a following sonorant, like the pin-pen merger (Labov et al., 2006). The size of each vowel's resulting dataset, which includes vowels that either precede obstruents or are in final position, is shown in Table I.
|Vowel .||Symbol .||Tokens .||Unique words .|
|Vowel .||Symbol .||Tokens .||Unique words .|
GAMMs are useful for dynamically variable data (Sóskuthy, 2017; Winter and Wieling, 2016; Wood, 2017a), especially when multiple measurements are taken from a single token and vary across time. Generalized additive models (GAMs) model linear predictors as a function of fixed parametric terms and one or more smooths, permitting the inclusion of nonlinear effects (and interactions), plus Gaussian noise (Wood, 2017a). When random effects are incorporated, the resulting mixed GAM can also include smooth terms for those effects, which are modeled using ridge penalties; these can participate in interactions, typically modeled with factor smooths. The models can account for temporal autocorrelation (Baayen et al., 2017), acknowledging that measurements are not independent; and they can detect significant differences in smoothed curve shapes.
GAMMs were applied to the DASS vowel trajectory data. A separate model was fitted to each of /i ɪ eɪ ɛ/, but the same model specification was applied to each vowel. In each model, the dependent variable was formant frequency in Barks (F1 and F2, 5 values each per vowel token). Predictor types included fixed parametric terms (factors or continuous parameters); smooths (in all cases with four knots, the maximum available when five data points are used), which model the shape of trajectories across normalized time (Percent) based on a grouping factor; and random effects. Since autocorrelation is possible with time-series data (Sóskuthy, 2017), in an exploratory modeling stage we evaluated the presence of autocorrelation in our data via the acf_n_plots() function in the itsadug package (Rij et al., 2017). We found that a minority of tokens showed autocorrelation, likely because our measurements are relatively far apart in time, and thus we do not further model autocorrelation.
With each model we tested the effect of two social factors, speaker sex and ethnicity, on formant values. These factors had two levels each, respectively, female (default level) versus male, and Black (default level) versus non-Black (labels inherited from the DASS dataset). All factors were treatment-coded, with a reference level dummy-coded as 0, to which estimates for each factor correspond in the model output, and to which other factor levels are compared. While we suspect that the effects of vowel shifting should increase with apparent time in this dataset, from older to younger speakers, initial modeling found no significant effects for age and that predictor is not further investigated here to avoid overspecification in our models. Similarly, the effects of social class, education level, and speakers' state of origin are left for future research. This is necessary because DASS does not include speakers representing all levels of those variables (for instance, there are no upper-class Black females), making it intractable to understand their effects and interactions statistically.
Variation at the phonological and speaker level was accounted for by four predictors. The first was the factor formant (default level = F1), which is necessary since all formant values are pooled into a single dependent variable (Faraway, 2006, Sec. 9.3; with articulatory data, Wieling et al., 2016; and for application to formant values, Gahl and Baayen, 2019). We included vowel duration, because we expect that vowels become (a) more peripheral and (b) more dynamic with increased length. This predictor was log-transformed, and was used as a parametric term only, to limit model complexity. We explored models that treated log duration as a smooth term, but some failed to converge. Additionally, the model included a random effect for each word. Our dataset includes 5187 unique words, and we argue that since the characteristics of each vowel depend on the lexically specific consonants (or lack thereof) that precede and follow it, this factor accounts for a great deal of vowel variation according to phonological context (see also Gahl and Baayen, 2019). We acknowledge that this method does not consider the systematicity of contextual effects. Finally, the model included a random effect for each speaker, which adjusts for differences in the characteristics of individuals' vowel spaces and datasets; the inclusion of this term justifies our avoidance of explicit data normalization.
While we expect that the social factors of ethnicity and sex do have separate, systematic effects on the formant values of front vowels, we hypothesize that these factors interact in a significant way: vowel dynamics of all four speaker groups may differ. To accommodate interactions in a GAMM context, sex and ethnicity were combined into a single four-level factor, which furthermore was crossed with the predictor Formant to create a three-way interaction term with eight levels, one for each combination of ethnicity, sex, and formant (default level: female, Black, F1). Model specifications included a further interaction between log duration and this three-way term, allowing the effects of duration to vary across all eight of its levels (see the supplementary material6).
Models were fitted using the bam() function within the mgcv package in R (Wood, 2017b). Significance of each factor was tested via model comparison with dropped factors, using the compareML() command within the itsadug R package (van Rij et al., 2017). In this technique, the output of a full model, with all factors, is compared to the output of a model with one factor dropped. If the full model receives a significantly lower AIC score (p < 0.05, evaluated via χ2 test), then the inclusion of the factor under consideration improves model fit and should be retained. Due to the inclusion of the three-way interaction between formant, sex, and ethnicity, three models were fitted in addition to the full model described above (see the supplementary material6):
A baseline, in which sex and ethnicity were not crossed together with formant, but were each crossed with formant separately, and treated as parametric predictors only.
A GAMM in which sex was entirely omitted, but ethnicity was crossed with formant and was treated as a parametric and a smooth term.
A GAMM in which ethnicity was entirely omitted, but sex was crossed with formant and was treated as a parametric and a smooth term.
We expect that for each vowel, the less-complex models (a)–(c) will underperform in model comparisons with the full model that includes two social predictors in a three-way interaction with formant.
We provide a descriptive overview of front vowel trajectories in DASS before presenting the results of modeling via GAMMs.
A. Vowel trajectories in F1,F2 space
Plotting F1,F2 over normalized time shows where formant trajectories begin and end in the vowel space. To exemplify this descriptively, four individual speakers' data are shown in Figs. 3 and 4. They are speakers who are Black or non-Black, and male or female (they are not balanced for age, social class, education level, or home state). Speaker 647 is an AA female, age 77 (b. 1897), from Boothville, LA; she is from a lower-class background and has 0–7 years of education. Speaker 596 is an African American male, age 78 (b. 1895) from Mississippi; he is from a lower-middle class background and has 8–10 years of education. Speaker 662 is a European American female, age 35 (b. 1939), from Schriever, LA; she is from an upper-middle class background and has 11–12 years of education. Speaker 579 is an EA male, age 87 (b. 1886), from Vicksburg, MS; he is from an upper-class background and has 13–16+ years of education.
The front vowel trajectories in Figs. 3 and 4 show that the non-Black speakers exhibit more centralized /ɪ/ and /eɪ/ in the last portions of the vowels than the Black speakers do. In addition, /ɪ/ is lower in non-Black than Black speakers. These differences are consistent with the contrast in diphthongization between the SVS and AAVS, and statistical modeling will test for the presence of similar effects throughout the dataset.
B. Statistical models of front vowel trajectories
Data from each vowel was evaluated with a GAMM. The output of all fitted models appears in the supplementary material.6 We expect that implementation of the SVS versus AAVS will result in significant differences in vowel position (seen in model output for parametric terms) and dynamics (manifested in smooth terms) across ethnicities. Additionally, uneven implementation of those shifts is expected to result in significant differences across sexes, for instance, because women have shifted a particular vowel relatively more than men of the same ethnicity, or because one sex tends to diphthongize a vowel more than the other.
For each vowel, model comparison is used to evaluate the relative fit of models with and without each social factor. The results of model comparisons are summarized in Table II: in all cases, the full model significantly improved over one without the social factor under consideration, as well as over the baseline model lacking a three-way interaction. This indicates that speaker sex and ethnicity do affect vowel trajectories, with regard to the vowels' height and backness, as well as the shape of the F1,F2 trajectory over normalized time. All models accounted for approximately 97% of variance in the data (see adjusted R2 values in supplementary material6).
|Models compared .||df .||/i/ .||/ɪ/ .||/eɪ/ .||/ɛ/ .|
|Ethnicity dropped vs full||16||769.504||367.832||621.600||118.527|
|Sex dropped vs full||16||226.789||363.861||431.471||142.563|
|Baseline vs full||20||858.830||650.510||1068.360||274.598|
|Baseline vs ethnicity-only||4||89.326||282.678||446.760||156.072|
|Baseline vs sex-only||4||632.041||286.649||636.890||132.036|
|Models compared .||df .||/i/ .||/ɪ/ .||/eɪ/ .||/ɛ/ .|
|Ethnicity dropped vs full||16||769.504||367.832||621.600||118.527|
|Sex dropped vs full||16||226.789||363.861||431.471||142.563|
|Baseline vs full||20||858.830||650.510||1068.360||274.598|
|Baseline vs ethnicity-only||4||89.326||282.678||446.760||156.072|
|Baseline vs sex-only||4||632.041||286.649||636.890||132.036|
In each full model's summary, the output for parametric terms indicates that in almost all cases, F1 terms are not significantly different with respect to the reference level (Black female speakers), while F2 terms are. Coefficient values for F2 are much higher than those for F1, and the same is true of Bark-scaled formant measurements; so it is trivial, and expected, that F2 coefficients are significantly different from the reference level. However, our models include a three-way interaction between formant, sex, and ethnicity, which is fitted as a smooth, and the smooths portion of each model summary shows the approximate significance for all levels of smooth terms, including F1 (the default). This output indicates, in all cases and for all vowels, that there are significant differences in trajectory shape across formants, sexes, and ethnicities. The significance of the smooth term for speaker sex, in particular, indicates that men and women have different-shaped trajectories; so, sex differences in DASS front vowels cannot be reduced to vocal tract-based variation.
For each vowel, the parametric output also summarizes the interaction between the combined social predictor and log duration. For all vowels, these interactions are significant, indicating that vowels' formant values vary according to their duration, and that the degree of variation changes across formants and social groups.
For /i/, where parametric terms are concerned, estimates are generally higher for female speakers, consistent with their expanded formant value range with respect to males. Female speakers of both ethnicities have relatively higher F2 than men (10.38 for Black women, 10.84 for non-Black women, versus 9.82 for Black men and 9.89 for non-Black men; p < 0.001 in all cases). However, log duration also interacts significantly with all levels of the combined social predictor, with the exception of F1 for non-Black male speakers. F2 appears to increase the most with duration for Black female speakers (β = 0.69, p < 0.001). Where the smooth terms, which detect differences in trajectory shape, are concerned, all eight levels of the three-way interaction are significant.
Turning to the model of /ɪ/, in the parametric output, non-Black females have an F1 significantly different from the reference level (β = 0.61, p < 0.001). For F2, estimates show again that women have higher second formant values for /ɪ/ than men; Black men have the lowest estimate (β = 8.16, p < 0.001), suggesting relatively greater centralization of that vowel. Log duration significantly interacts (p < 0.001) with all levels of F2 of the combined social predictor, but particularly for F2 in Black female speakers (β = 0.44, p < 0.001); and all smooth terms are significant (p < 0.001).
The model for /eɪ/ reveals variation in F1 for non-Black female speakers, whose estimate is more than 1 Bark higher than the reference level (β = 1.16, p < 0.001). Meanwhile, F1 for Black male speakers was not significantly different from that of the reference level, Black female speakers (β = −0.05, p > 0.05). Log duration interacts significantly (p < 0.001) with all levels of the combined social predictor, except for F1 for Black males (β = 0.01, p > 0.05), suggesting that Black speakers of both sexes have similar relationships between vowel timing and height. All smooth terms are significant (p < 0.001).
Finally, the model of /ɛ/ shows moderate variation in F1 across speaker groups. Male speakers have somewhat lower F1 values (higher vowels) than female speakers (estimates: β = −0.22 for Black, β = −0.19 for non-Black male speakers, p < 0.001), while females have higher F1 values (lower vowels) for /ɛ/ (β = 0.41, p < 0.01). A striking result is that non-Black female speakers' estimate for F2 is higher than all other groups (β = 7.59, p < 0.001), indicating that their /ɛ/ is relatively fronted. Log duration interacts significantly with all levels of the combined social predictor (p < 0.001), though in F2, this effect is less for females than for males. The main effect's estimate is β = 0.21, p < 0.001, with F2 adjustment for Black females β = −0.103, p < 0.001; for non-Black females, β = −0.069, p < 0.001; for Black males, β = −0.186, p < 0.001; for non-Black males, β = −0.127, p < 0.001. As with the other models, all smooth terms are significant (p < 0.001).
C. Modeled vowel trajectories
Since GAMMs are used to model nonlinear effects, visualization is an effective method of summarizing their output, alongside statistical model summaries. The output of each fitted GAMM was used to generate predictions (Rij et al., 2017) for each of the three tested variables (formant, sex, and ethnicity), and these predicted values were plotted in F1, F2 space to illustrate the differences across sex and ethnicity groups. Factors not specified in the predictions were set to the default “median” or “mode” levels (depending on the data type of the predictor) assumed by the itsadug package. The value for log duration was −2.30259 (corresponding to 0.100 s) and the random effects were for speaker 330 (Male, non-Black) and the F1 of the words be, if, a, or yes for /i ɪ eɪ ɛ/, respectively.
The modeled trajectories are shown in Fig. 5, whose four panels each plot the predicted values of one full GAMM, permitting visual comparison of trajectories across sexes and ethnicities. Figure 5 reveals that non-Black female speakers consistently occupy different ranges of the vowel space than other groups, with higher F1 and F2 values. Male speakers' trajectories are lowest in Bark-scaled F1,F2 values, consistent with the persistence of biologically based sex differences; but Black females' vowels lie closer to men's than to other women's trajectories. Although there are similarities within sexes or ethnicities, across the panels each trajectory is unique in its location, length, and shape. All four vowels are characterized by an initial movement towards a more fronted and extreme position (higher, for /i ɪ eɪ/; lower for /ɛ/), followed by a swing towards a more centralized position with lower F2. This trend applies to all four social groups, with the exception of /i/ among female AA speakers, whose trajectory from 50% to 80% is more fronted than it is from 20% to 50%. This may be an anomaly that results from the relative scarcity of data for Black female speakers in DASS, since out of six total speakers, only five produced usable data. We argue below (see Fig. 8) that this is not an artefact of the analysis with the generalized additive model; given the enormous amount of variance the model explains, it must be the case that predicted curves are very close to observed curves.
Model comparisons and Fig. 5 confirm significant differences in the trajectories of individual vowels across speaker groups, but to evaluate the strength of speaker participation in the SVS or AAVS it is necessary to consider the four vowels as a system. GAMMs show that different groups' formant values occupy different ranges, but shifts are evaluated by the positions of vowels within a single speaker or group, relative to one another. We next consider the front-vowel systems of the four social groups separately, plotting the same model predictions with one panel per speaker group. Figure 6 shows Black female and male speakers, and Fig. 7 plots Non-Black female and male speakers' predictions. These show clear differences in the relative positioning of vowels across ethnic and sex groups, indicating which vowels' trajectories overlap in formant space and whether they cross early or late in their time course. The use of GAMMs highlights inter-group differences in the extent of diphthongization and VISC that affect each front vowel. We argue in Sec. V that these differences are consistent with the implementation of the SVS versus the AAVS among DASS speakers.
As described in Sec. IV B above, the model summary of each GAMM reported a fit of approximately 97%, indicating that they captured a great deal of the variation inherent in each vowel's data. We reiterate that the model fit is indeed quite close: compare the original Bark-scaled data, shown in Figs. 8 and 9, with the predicted values plotted in Figs. 6 and 7. The original data reflect the same relative placements and trajectory shapes as the predicted values, though they have not been smoothed. Thus, while the prediction plots demonstrate the range of dynamics occurring in these already dynamic vowels, the original data highlight the models' close fit to acoustic trends observable in raw data. Taken together, they reinforce the finding that social factors of sex and ethnicity do have significant and consistent effects on vowels' spectral characteristics.
Figure 8 shows that the trajectory of Black females' /i/, in the aggregate, does finish with a higher F2 than its first half; thus, this unusual shape is not an artefact of statistical modeling, but is a characteristic of the data. We again acknowledge that only five Black female speakers are represented here, out of six available in DASS; future research will survey a larger number of Black speakers from the 235 recorded in LAGS (Pederson et al., 1986), which may reveal this trend to be sample-specific.
D. The effect of duration on formant trajectories
Vowels' positions within the vowel space are also affected by duration, which was log-transformed and included as a parametric term in each GAMM. For all four vowels, duration interacted significantly with the three-way predictor combining formant, sex, and ethnicity (discussed in Sec. IV B), but was also significant as a main effect. We reiterate that in the interest of interpretability the models do not account for how duration interacts with the shape or length of trajectories, but only with their position within F1,F2 space. It is expected that trajectory shape also varies with duration (Fridland et al., 2013), but that is beyond the scope of our investigation.
Figure 10 summarizes the effect of duration on vowels' position in the formant space, by plotting trajectories based on model predictions assuming the 25th percentile duration (0.070 s), the median duration (0.100 s), and the 75th percentile duration (0.141 s). For all four vowels, a greater duration coincides with a more extreme vowel space placement:7 /i eɪ/ both raise and front with increased duration, while /ɪ/ becomes fronter but does not change in F1, and /ɛ/ becomes lower and fronter at longer durations. With the possible exception of /ɪ/, none of these correlations directly represent an increase in SVS or AAVS strength, but suggest that speakers reach more extreme acoustic and articulatory targets under lengthening conditions. Our results do not straightforwardly shed light on the relationship between vowel duration and the acoustic distance between vowels (cf. Fridland et al., 2014). This is, first, because our models are fitted to individual vowels, not multiple vowels at once; because we do not quantify shiftedness via some measure like Euclidean distance; and because our models do not directly explore the relationship between trajectory shape (or length) and duration. The results in Fig. 10 suggest that certain vowel pairs, like /eɪ ɛ/, may become more distinct at longer durations, since they move in different directions. This is probably the product of emphasis, i.e., vowels that are prosodically prominent (in addition to carrying stress) are longer and more extreme.
This paper has investigated the dynamics of front vowels using a very large new acoustic dataset from historical audio recordings. We argue that its 64 speakers represent a population within which the SVS and AAVS flourished and expanded during the 20th century. Both of these shifts (and vowels in many other varieties of English) are best described not only in terms of vowels' relative positions in acoustic and articulatory space, but also as the result of VISC. This may be greater or lesser in magnitude, or affect vowel height or backness differently, among different populations. The relevance of dynamics to these dialects' characteristics motivates the application of GAMMs to nonlinear relationships in the data, in particular the effects of speaker ethnicity and sex on the first and second formants, over the normalized time course of /i ɪ eɪ ɛ/.
A. Evidence for the SVS and AAVS
Applied to formant measurements taken at five time points in each vowel token, GAMMs indeed show that formants' trajectories differ between Black and non-Black, female and male speakers, and also by vowel duration. Our models provide support for the presence of the SVS among EA speakers, and for the AAVS among AAs, by showing differences in VISC across those groups. We discuss these with reference to Figs. 6 and 7.
The trajectories of mid vowels /eɪ ɛ/ cross one another, indicating overlap, for both ethnicities and sexes. For EA females, the degree of overlap is strong, because the onset of /eɪ/ is more centralized than all other front vowels, while /ɛ/ is raised and fronted, so that /ɛ/ crosses /eɪ/'s trajectory around its midpoint. In the other three speaker groups, the mid vowels overlap near their onsets; /eɪ/ is less centralized and /ɛ/ is less peripheralized than in EA women. Based on the relative positions of /eɪ ɛ/, it appears that EA females have heavily shifted mid vowels, while EA males and all AA speakers also exhibit some mid-vowel shifting but do not approach a reversal. Considering more recent evidence that Southern speech is indexed in production and perception by a decreased Euclidean distance between /eɪ ɛ/ (Fridland et al., 2014; Fridland and Kendall, 2012; Kendall and Fridland, 2012), our result suggests that EA women participate most strongly in this portion of the SVS. The AA pattern is consistent with both formulations of the AAVS.
Among the high vowels /i ɪ/, several comparisons are worth highlighting. The trajectories of high vowels /i ɪ/ have different trajectories across ethnicities, and across sexes (cf. sparse evidence of variation across high vowels in Farrington et al., 2018, Table VI). /ɪ/ becomes fronter over its normalized time course, by approximately 0.25 Barks for men and slightly more for women overall. However in AA speakers (Fig. 6) the change in F1 is much greater, approximately 0.6 Barks for AA women (versus <0.25 Barks in EA women) and 0.3 Barks for AA men (versus negligible change in EA men; cf. Fig. 7). Additionally, /ɪ/ ends higher in the vowel space for AA speakers than for EA speakers. This suggests that if the AAVS is active for DASS speakers, the raising of /ɪ/ involves increased change over the vowel's time course, rather than raising of the entire nucleus; in other words, /ɪ/ is more dynamic for AA speakers. Although /i/ raises more in AA speech than among EA speakers, EA speakers show greater fronting in /i/.
The trajectory of lax /ɪ/ crosses /eɪ/ for female speakers of both ethnicities, in both cases near the offset of /eɪ/, when both vowels are nearing their most peripheral points. Among male speakers, the trajectories do not appear to cross, because /ɪ/ is relatively less fronted than /eɪ/. This suggests that peripheralization of /ɪ/ is most active for women. Overall Figs. 6 and 7 are consistent with the centralization of /i/ and peripheralization of /ɪ/ predicted for EA SVS speakers. For AA speakers, it is not clear that both lax vowels have advanced; both /ɪ, ɛ/ remain lower, and more centralized, than their tense counterparts. This may reflect an earlier stage of shifting with respect to synchronic descriptions.
Women appear to be at the forefront of shifting towards both the SVS and AAVS. This is seen among EA speakers (Fig. 7, top) in the realization of /eɪ/, which begins in a highly centralized position for women (consistent with the SVS) and has the longest trajectory of any vowel in Figs. 6 and 7, indicating its highly diphthongal nature. EA women's /eɪ ɛ/ trajectories also show evidence of reversal for those two vowels, in their first halves, although male speakers' /eɪ/ is also initially lower than /ɛ/. Figure 6 shows that among AA women in particular, /ɪ/ is raised, in the sense that it overlaps with /i/ in F1 space (though an alternative possibility is that /i/ has lowered for these speakers). The trajectory of AA females' /ɪ/ also ends higher than their /eɪ/, indicating raising consistent with descriptions of lax vowel peripheralization in the AAVS (Thomas, 2007).
B. Duration and vowel position
Vowels tend to be at their most acoustically extreme when they are long, a situation that favors hyperarticulation (Lindblom, 1990; Gendrot and Adda-Decker, 2005). In keeping with cross-linguistic findings, our models show that all four vowels have lower F2 at shorter durations, and the highest F2 at long durations. Where F1 is concerned, the tense vowels /i eɪ/ are higher in the vowel space at long durations, while the F1 of /ɪ/ varies little, and /ɛ/ lowers phonetically as its duration increases. The tense vowels are described as lowering and retracting within the SVS and for some AA speakers, but these results hint that the effect of tense-vowel centralization is actually less at long durations. Similarly, the tendency for /ɛ/ to lower as it fronts at longer durations also indicates that overlap among front vowels (as per both the SVS and AAVS) decreases in those contexts, which are expected to include perceptually salient instances of primary stress, emphasis, or in monosyllabic words.
However, further research is necessary to tease apart the combined effects of duration, ethnicity, and sex on vowels' position; our models indicate complex interactions between those predictors, but they are not separately considered in all visualizations. Vowel duration is argued to be systematically longer in Southern AAs' speech compared to EAs; and among front vowels the duration distinction between /i, ɪ/ is lessened for AAs, but increased for /eɪ, ɛ/ (Holt et al., 2015). Duration is an integral dimension of speech dynamics, and it is likely that baseline duration differences across ethnicities interact with differences in vowel quality, contributing to perceptually distinct speech varieties [see Thomas and Reaser (2004) for a review].
Future research will also flesh out the effect of duration on vowel trajectory shape, which is not fully explored here—note that in Fig. 10, the sets of lines for each vowel are parallel, when it is likely that the effect of duration on vowel articulation and acoustics is nonlinear. Duration affects the extent of vowel shifting among Southern speakers, who generally exhibit longer vowels than speakers of other American varieties (Clopper et al., 2005; Jacewicz et al., 2007). In some cases, this lengthening leads to longer lax than tense front vowels. Fridland et al. (2013) find a correlation between the relative durations of /eɪ ɛ/ and the Euclidean distance between them: the longer /ɛ/ becomes, the more it overlaps with /eɪ/ in the vowel space of speakers from Memphis, TN. The results of Fridland et al. (2013) suggest this may be a strategy at the level of individual speakers, used as a correlate of Southern vowel shifting, a possibility that our findings do not contradict. On the other hand, in an investigation of speakers from Raleigh, NC using cubic polynomial coefficients, Risdal and Kohn (2014) do find that among EA speakers, front lax vowels' F2 contours become more diphthongal as duration increases. A GAMM analysis focusing on this relationship could indicate the extent to which these tendencies hold in DASS.
C. Dynamic variation in the South
Dynamic formant movement is extensive in the front vowels of DASS, and considerable overlap occurs between tense and lax vowel trajectories. Based on our findings we support the position that “there is something beyond F1 and F2 nucleus position and duration that maintains these vowels' distinctiveness,” and that “the distinctions might be tied to vowel trajectory” (Fridland et al., 2014, p. 347). Our visualizations, resulting from statistical modeling, show that the trajectories of /i ɪ eɪ ɛ/ are highly distinct in their relative positions, length, and directions of VISC. While our analysis has also shown that DASS front vowels vary as a function of speaker sex and ethnicity, a full account of these vowels' dynamics will include the other social factors known for each speaker, including birth year, state or sub-region of origin, social class, and education level. Even with a corpus the size of DASS, social effects can be elusive, partly because the corpus is not fully balanced for them, and does not include a large number of speakers from each demographic category. For instance, data from only 16 AA speakers are available, across eight states, both sexes, and several social classes and education levels. Thus, further investigations may require a larger sample of speakers from LAGS. The role of phonological predictors in the SVS and AAVS could be evaluated more thoroughly, though there are no known conditioning contexts specifically for the front-vowel shifts discussed here. Finally, although GAMMs help us characterize divergent phonological systems in terms of VISC, the links to speech perception that motivated earlier models of VISC (Ferguson and Kewley-Port, 2002) remain to be tested with this dataset. Careful experimental studies and modeling will be used to investigate these effects in more detail, to pinpoint how vowel trajectories are most saliently linked to a speaker's social characteristics.
The authors thank the Editor, Ewa Jacewicz, R. Harald Baayen, and two anonymous reviewers for generous feedback on this manuscript. The analysis also benefited from helpful comments during the 174th Meeting of the Acoustical Society of America, and from training via the Analysing Curves workshop hosted at the University of Manchester in 2018. This research was supported by NSF BCS Grant No. 1625680. DASS is provided by the Linguistic Atlas Project at the University of Georgia.
While many researchers refer to speaker ethnicities with the labels “European American” (EA) and “African American” (AA), the DASS dataset labels speakers as “non-Black” and “Black.” We use these terms loosely interchangeably in text and figures.
Future research will cover other vowels involved in the SVS, especially the dynamics of /æ/. While it is outside the scope of this paper, /æ/ in Southern speech is highly dynamic (Koops, 2014): it may have an upglide towards [ɛ], and older speakers (including many represented in DASS) may distinguish the BATH class from the DANCE class of words (Thomas, 2005). An analysis of /æ/ including these speakers appears in Renwick and Olsen (2017).
Unless otherwise stated, data preparation was done using various packages within the Tidyverse. Data visualizations were created using ggplot2.
Exploratory modeling using Lobanov-normalized data produced visualizations similar to those from Bark-transformed data, but the models all fit the dataset much worse (33%–74% R2, versus >90% R2 using Barks). In general, taking the log transformation of a predictor results in a better model fit (Gelman and Hill, 2007).
This has implications for statistical modeling, because our models treat formant as a predictor. If F1 and F2 are not scaled identically (as when measured in Hz), then since F2 occupies a wider range of values, it may receive greater weight in statistical modeling, and variation in F1 (though perhaps considerable) may be found insignificant.
See supplementary material at https://doi.org/10.1121/10.0000549 for (a) model specifications and summaries for full models; (b) additional model specifications and summaries; (c) model comparisons.
In their GAMM-based evaluation of the relationship between age and phonetic variation, Gahl and Baayen (2019: Fig. 3) similarly find that the size of speakers' vowel space increases with vowel duration.