An acoustic analysis was made of the speech characteristics of individuals recorded before and during a prolonged stay in Antarctica. A computational model was used to predict the expected changes due to close contact and isolation, which were then compared with the actual recorded productions. The individuals were found to develop the first stages of a common accent in Antarctica whose phonetic characteristics were in some respects predicted by the computational model. These findings suggest that the phonetic attributes of a spoken accent in its initial stages emerge through interactions between individuals causing speech production to be incrementally updated.
Every person has a spoken accent that depends on the community of speakers with which the individual is in regular contact (Milroy, 1987). A spoken accent, which can be socially and/or geographically conditioned (Labov, 1994), can be defined as shared spoken idiosyncrasies across a community of speakers. A local spoken accent is readily acquired by children but hard to fake accurately by adults and outsiders (Jensen et al., 2015). This is because the phonetic characteristics of a spoken accent are nuanced (Alam and Stuart-Smith, 2014) and depend for their accurate communication on a very high degree of precision in timing and coordination of the vocal organs. Some have proposed that because an accent is hard to imitate it has the evolutionary function of a tag in order to identify its members and to exclude imposters (Cohen, 2012). There is also evidence that a common spoken accent facilitates cooperation (Doise et al., 1976) or at least coordination (Jensen et al., 2015) between individuals. However, the mechanisms by which a common spoken accent comes into being and above all why a given spoken accent sounds the way that it does are poorly understood.
The catalyst for change in these very earliest stages of accent formation is thought to be communication density (Bloomfield, 1933), i.e., who talks to whom and how often. As Labov (2001, p. 19–20) notes, such changes in these earliest stages can be mechanical and inevitable in which social evaluation plays only a very minor role. Trudgill (2008) reasons that the earliest stages of the accent that formed as a consequence of settlement in New Zealand English depends on imitation, in particular between children, and that there was no social force behind the development of this variety, such as the need to aspire to a common New Zealand identity (but see Baxter et al., 2009). In episodic models of speech, imitation is a consequence of a feedback loop by which production can be updated by perception (Todd et al., 2019). Thus imitation is inevitable because whenever a speaker produces a word its acoustic characteristics will necessarily be shaped by the persons that the speaker has conversed with both in the recent and more distant past (Hay and Foulkes 2016).
In order to constrain the many variables that can lead to new accent formation, we recorded from a small number of individuals before and while spending several months during an Antarctic winter as part of the British Antarctic Survey (BAS). We sought to predict these changes in Antarctica using an agent-based computational model applied to the same individuals' speech data recorded before they had left for Antarctica. The situation in which Antarctic “winterers” are together for several months is the closest present-day microcosm of former colonial settlement: there is no access to or from Antarctica in winter and the winterers are in regular (spoken) contact with each other. Moreover, the conditions for the development of spoken accent may be enhanced precisely because they must regularly coordinate their actions not only for scientific purposes but also in order to survive.
The focus of the study was on vowels in English in which the quality has been shown to shift both synchronically and diachronically: these were /ɪ:, u, ou/ exemplified by the vowels in the lexical sets (Wells, 1982) happy, goose, goat, respectively, and /ju/, a combination of the /j/ glide and the vowel in goose which we denote by few. Vowel fronting was predicted for few and goose vowels because synchronically there is a greater tendency for /u/ to front than to retract (Harrington et al., 2011) and because there is evidence that /u/ has fronted diachronically in the last 50 years in several varieties of English [e.g., Fridland (2008) for American English; Hawkins and Midgley (2005) for British English]. The prediction of fronting could extend to goat given that the second component of this diphthong fronts at faster rates of speech (Gay, 1968, Table II) and other evidence showing recent diachronic fronting of goat in some English varieties (e.g., Thomas, 1989). It is more difficult to make predictions about the final, lexically unstressed vowel exemplified by happy which in some British English varieties shares phonetic characteristics of both the lax vowel in kit and tense vowel in fleece: it could centralize toward kit given the tendency for unstressed vowels to shift toward the center of the vowel space; but on the other hand, it could front and raise toward fleece commensurate with the sound change by which happy has become more tense in British English in the latter part of the 20th century (Harrington, 2006).
A. Participants, materials, and recording procedure
The participants were 11 (age range 21–46 yr; median age, 30 yr; 4 females) winterers recruited from the BAS. They were recruited from a cohort of 26 winterers who were at the same station of the BAS over the same period of time. The restriction to 11 was because these were the only participants who were available for one recording session before and multiple recording sessions during a period of several months in Antarctica. The winterers were aware of the purpose of the present study, i.e., to measure changes to speech as a consequence of spending time in Antarctica. Their accent backgrounds were mixed: eight were born and raised in England (five in the south/southeast and three in the north/northwest), one was from northwest U.S., one winterer's first language was German and another's was Icelandic. The winterers underwent extensive pre-deployment training as a group and were expected to collaborate on daily tasks while in Antarctica. The winterers had diverse roles in Antarctica including a chef, a doctor, an electrician, an IT engineer, a plumber, and a mechanic as well as several scientists and support staff. They ate meals in a communal dining room and were encouraged to share provisions and organize communal activities in their spare time.
The first (baseline) recording was made from each winterer in September 2017 before they went to Antarctica, at which time the winterers had had minimal contact with each other. Four re-recordings were made at approximately 6 weekly intervals in Antarctica from the same winterers between March and August 2018. Each recording session took about 10 min. Baseline recordings were made with one of the experimenters present either in a quiet room in Cambridge, U.K. or in a private home elsewhere in England, depending on the winterers' availability. Recording sessions in Antarctica took place in a quiet room and were coordinated by one of the winterers who had volunteered to assist with the project. Winterers read a randomized list of five repetitions of 29 individual words (see Table I) as they appeared one at a time on a computer screen. For the baseline, they read an additional four words (head, had, hard, hoard) of which the last three were used only for the purposes of speaker normalization (head was not used at all in the analysis). They were asked to talk normally as if they were reading the words for a friend. All recordings were made onto a laptop computer with a Sennheiser USB-headset microphone (Sennheiser, Wedemark, Germany). Acoustic phonetic boundaries were marked semi-automatically and if necessary manually corrected using established procedures. One of the 29 words was subsequently discarded because it had been incorrectly produced by more than one winterer. The remaining 28 words that were analyzed are shown in Table I.
|.||/f(l)/ .||/s/ .||/k/ .||/h/ .|
|/ou/||flow, airflow||sew, torso||code, disco||hoed, backhoe|
|/ju/||feud, curfew||queued, rescue||hewed|
|.||/f(l)/ .||/s/ .||/k/ .||/h/ .|
|/ou/||flow, airflow||sew, torso||code, disco||hoed, backhoe|
|/ju/||feud, curfew||queued, rescue||hewed|
B. Acoustic analysis1
The first two formant frequencies (F1, F2) were calculated for each vowel at intervals of 5 ms. The formants were z-score normalized in order to reduce as far as possible the influences of anatomical differences in the size and shape of different speakers' vocal tracts. More specifically, this speaker normalization was carried out using Eq. (1),
in which and are, respectively, the normalized and raw formant frequency values of formant number j (j = 1, 2) produced by winterer i in utterance k at time t and where and are the mean and standard deviation of all formant values between the acoustic onset and offset for formant number j in the vowels of the words heed, hoard, hard, had produced by the same winterer in the baseline.
Each speaker-normalized formant trajectory between the vowel's acoustic onset and offset was decomposed into a set of 1/2 cycle cosine waves using the discrete cosine transform [DCT; see Harrington and Schiel (2017) for formulas and details]. The acoustic modeling in the present study was based on the first three of these (at frequencies k = 0, 0.5, 1 cycle) whose amplitudes are proportional to the mean, linear slope, and curvature of the trajectory, respectively. The DCT models the shape of a trajectory but does not explicitly encode duration. The main reason for representing formants using the DCT is because this transformation provides an effective way of encoding signal dynamics that are appropriate for modeling vowels that have inherent change both synchronically (such as the diphthong /ou/ or the glide + vowel /ju/ in the present study) and diachronically (e.g., the emergence of diphthongs from monophthongs as in the Great English Vowel Shift).
C. Agent-based modeling
An agent-based model (ABM) applies principles from statistical physics to social dynamics (Castellano et al., 2009). Some ABMs have been used to explain evolutionary aspects of language change (de Boer, 2015). The present ABM is by contrast concerned with modeling the relationship between speech communication and regular sound changes. The ABM applied to the Antarctic winterers' speech data is based on the idea that sound change comes about because passive listening updates the same speech sound categories that form part of speech production and perception (Ettlinger, 2007; Todd et al., 2019). In contrast to all other computational models of sound change, however, the present ABM takes as input real speech (rather than artificially generated data) from actual speakers. Another unique aspect of the present ABM is that it models signal dynamics, i.e., speech signals that change in time rather than categorical data or speech represented by values at a single point in time.
In the present ABM, there were 11 computational agents, one per winterer. Each agent was initialized with the 28 word classes shown in Table I. Each word class in the ABM was associated with a single vowel class (those associations shown in Table I) and also with up to five signals (the five repetitions in the baseline) that each consisted of a vector of 6 DCT coefficients (3 for F1, 3 for F2). In contrast to, e.g., some models in which an agent talks to itself (Blevins and Wedel, 2009; Todd et al., 2019), interactions were always pairwise in the present model between one agent (designated as the agent talker in that interaction) and a different agent (designated as the agent listener). During an interaction, an agent talker randomly selected a word class, W, from the set of 28 possible word classes. A signal, S, a vector of 6 DCT coefficients, was generated for the agent talker using a Gaussian distribution calculated in a six-dimensional space whose dimensions were the DCT coefficients. The parameters of the Gaussian model, the mean and covariance matrix, were calculated from the 5 signals associated with the agent talker's selected word W augmented to a total of 20 observations using the SMOTE resampling method (Chawla et al., 2002). This augmentation was necessary in order to increase the robustness of the Gaussian model. Once the signal, S, that was to be transmitted to the agent-listener had been generated in this way, the additional observations derived using SMOTE were discarded. The transmitted signal S was absorbed into the agent listener's same word class W but only if it was probabilistically closer to the vowel, V, with which W was associated, than to any other vowel (Harrington and Schiel, 2017). This type of selective updating which is found in other models (e.g., Blevins and Wedel, 2009; Todd et al., 2019) was necessary to prevent an agent-listener from adding signals to memory that were not representative of the vowel class (to prevent, for example, V = /i/ being augmented with DCT-coefficients that were probabilistically closer to the agent listener's /u/). If memory updating took place, then one of the signals that was associated with the listener's W was removed. In this way, the number of signals per word class stored in each agent's memory remained constant following interaction. Whereas in Ettlinger (2007), memory decay comes about by decrementing the strength of each exemplar exponentially over time, the signal that was removed in the present ABM was the one with the lowest probability of class membership to the listener's W. This random pairwise communication between agents was repeated 5000 times beyond which the change to the acoustic vowel positions in the agents' memories consisted only of stochastic fluctuations around the mean. Given that the output of each run is stochastic, the final analysis of the model's output reported below was based on 100 independent runs (each of 5000 interactions).
D. Statistical analysis
Shifts toward or away from the center of the vowel space were tested by calculating the Euclidean distances to the /i/-centroid. The distances and centroids were calculated in a space formed from the three DCT coefficients derived from F2. Vowel change in Antarctica relative to the baseline was quantified with the lmer package in the R programming environment using four mixed models, one for each vowel, of the form shown in Eq. (2),
in which dist_i was the distance to the /i/-centroid, session was a fixed factor (baseline vs Antarctica) and in which word (between 3 and 8 levels depending on the vowel, one per word) and winterer (11 levels: one per winterer) were random factors. Analogous 400 models (4 vowels × 100 ABM runs) were run to test for vowel change in the ABM relative to the baseline where the only difference in Eq. (2) was the random factor agent (11 levels: one per agent) in place for winterer. The terms session|winterer in Eq. (2) and analogously session|agent in analyzing the ABM output were used to provide information about the magnitude of change in the separate winterers and agents, respectively.
The aggregated trajectories with associated confidence bands of F2 as a function of time in Fig. 1 suggest group level changes only for /ou/, which was produced with a more fronted position in, compared with prior to, Antarctica. The same figure shows that the computational model predicted the direction of change for /ou/ as well as the lack of change in /ɪ:/, although it clearly exaggerated the degree of F2-raising for the group in the other two vowels. The statistical analysis in Eq. (2) showed that there was a significant change in Antarctica relative to the baseline in /ou/ ( = 10.6, p < 0.01) but in none of the other three vowels. The mean durations aggregated across winterers in the baseline and in Antarctica, respectively, were as follows: /ɪ:/: 119 ms, 149 ms; /ju/: 210 ms, 244 ms; /ou/: 185 ms, 209 ms; /u/: 235 ms, 263 ms). This increased duration, i.e., slowing down of the speech production rate in Antarctica suggests that any of the observed vowel changes in Fig. 1 between the baseline and Antarctica were unlikely to have been brought about by a more hypo-articulated speaking style (which tends to cause a decrease in vowel duration).
We then tested for convergence between the winterers by determining whether there was a correlation between the by-winterer slope and intercept in the mixed model [both obtained from session|winterer in Eq. (2)]. Here the intercept was the mean by-winterer vowel position prior to Antarctica minus the group mean prior to Antarctica (i.e., at baseline); and the slope was the change in vowel position in Antarctica relative to the baseline. If there is convergence, then these parameters should be negatively correlated, i.e., those winterers with the largest positive/negative slopes in the mixed model (indicative of a large change due to being in Antarctica) should also be those whose baseline positions are furthest from the group mean. Figure 2 suggests just such a correlation for all vowels. These parameters were shown to be significantly negatively correlated at an alpha-adjusted level2 of 0.013 for /u/ (r = −0.82), for /ju/ (r = −0.84) and for /ou/ (r = −0.70) but not for /ɪ:/ (r = −0.65). The intercept and slope were also significantly negatively correlated at the same level in all 100 ABM runs for all vowels, except in four cases for /u/. The median correlations between the intercept and slope in the ABM were −0.97, −0.99, −0.99, and −0.92 for /u/, /ju/, /ou/, and /ɪ:/, respectively.
Finally, the correlations between the by-winterer and by-agent slopes were computed in order to determine whether the actual changes in Antarctica were predicted by the changes due to the ABM. A positive correlation means that there was an association between the changes in Antarctica and those produced by the ABM relative to the baseline. Figure 3 suggests quite a high correlation between these two sets of slopes for three out of four vowels. A further test was made of whether the changes in the ABM and in Antarctica relative to the baseline were in same direction. The binary dependent variable for this purpose was a logical value that was True whenever the sign of the change in distance to the /i/-centroid with respect to baseline was the same for both the ABM and for Antarctica (i.e., both positive or both negative), otherwise False. For this purpose, the binary dependent variable, Agreementij, was calculated from Eq. (3),
where for agent or participant i and for word j, ABMij was the aggregated value of dist_i in the ABM, Antarcticaij was the aggregated value of dist_i in Antarctica, and Baselineij was the aggregated value of dist_i in the baseline, i.e., prior to leaving for Antarctica. A mixed model was carried out with Agreementij as the dependent variable and with two random factors: the agent (or winterer) and the word. The results of this test showed that the overall agreement between the ABM and Antarctica in the direction of change relative to the baseline was significantly greater than chance (z = 2.8, p < 0.01).
The study has shown that there were two types of phonetic changes among the group of winterers due to spending time together in Antarctica for several months. The first was that the group developed an innovation and produced a phonetically more fronted /ou/ in, compared with prior to, Antarctica. The second was that there was convergence among the winterers such that the between-winterer differences for each of the other three vowels were less in, than prior to, Antarctica.
Such changes when a group cooperates over a long period of time are predicted by exemplar-based models (Pierrehumbert, 2003) and their computational implementation (Harrington and Schiel, 2017; Todd et al., 2019), in which there is a feedback loop such that speech production is incrementally updated by speech perception. Compatible with other findings, the present study shows that there can be phonetic convergence between adults (Pardo et al., 2012) and that spoken accent is labile in adulthood (Harrington et al., 2000). The study also shows that new accent development does not necessarily just involve convergence toward a group average and that there can be—as in the case of /ou/ in the present study—shifts that are not so straightforwardly related to individuals' initial phonetic positions before they communicate together in an isolated community. The reason for such innovative shifts currently remains unclear. Following Stevens et al. (2019), such innovations may derive not just from the position but also how for each individual the distributions of phonological categories are oriented with respect to each other in an acoustic phonetic space. The distance between phonological categories (Kim et al., 2011) may be another factor that conditions whether or not innovation takes place.
The winterers' vowel changes due to being in Antarctica were quite well—although by no means entirely accurately—predicted by the ABM that had been applied to the same vowel data before the winterers left for Antarctica. The group-level similarity between the actual and model output lies in the direction of the change. Thus as a group level (Fig. 1), there was a similar pattern in Antarctica and in the ABM of F2-raising in /ju, u, ou/, and negligible change in /ɪ:/ relative to the baseline: the main difference is that the magnitude of group-level change was much greater in the ABM than in the real data. At the level of the individual, there was a close and significant correspondence between the actual and computationally modeled data (Fig. 3) in the direction and magnitude of the change for all four vowels relative to the baseline. Thus, the computational model predicted some (but by no means all) of the observed changes in Antarctica.
Given that the ABM had no knowledge of social factors such as gender, prestige, or likeability, then some of the observed changes in these very earliest stages of accent development are—compatibly with the predictions of both exemplar-based models (Stevens et al., 2019; Todd et al., 2019) and some ideas from sociolinguistics (Labov, 2001; Trudgill, 2008)—likely to be a stochastic function of population dynamics combined with the distribution, orientation, and position of phonological categories in an acoustic-phonetic space at the level of the individual and of the group. The discrepancy between the actual and computationally modeled changes in Antarctica could have come about because there is no predictable link between the actual time spent in Antarctica and the number of interactions in the model. Thus, one of the reasons why the model might be exaggerating the magnitude of the shift in /ju, u/ relative to the actual group-level change is because the model's changes could be those that would happen after isolation for a period of time considerably longer than the several months actually spent by the winterers in Antarctica.
Many other factors could have contributed to the observed changes to the winterers' speech in Antarctica that have not been modeled here. One of these is language learning. Consider in this regard that one of the speakers that shifted the most between the baseline for three vowels (speaker J) was a first language speaker of German: she may therefore have fronted /ou, ju, u/ because her L2-English was becoming more native-like with practice rather than primarily as a consequence of updating speech production via the aforementioned perception-production feedback loop.
We also emphasize that neither the changes to the winterers' vowels nor the associated predictions made by the ABM are necessarily representative of accent development in the entire community in Antarctica, nor indeed of communities such as those due to colonization in former centuries that were isolated for a long period of time. This is because the analyzed sample was of only 11/26 participants that spent the winter in Antarctica (and there is no evidence that these 11 interacted with each other any more than with the remaining, unanalyzed 15 participants). In addition, the changes in Antarctica as well as those predicted by the ABM are obviously strongly influenced by two outlier speakers, J (a first language speaker of German) and O (a speaker of General American). We therefore caution against extrapolating general conclusions from this small and indeed skewed sample of speakers.
Finally, both computational models (e.g., Baxter et al., 2009) and recent analyses of phonetic change in individuals isolated together for a period of time (Sonderegger et al., 2017) suggest that population dynamics combined with updating speech sounds through passive listening may be insufficient to explain sound change that may well be also be driven by social factors, even in these very early stages of new accent formation.
This research was supported by European Research Council Grant No. 742289 “Human interaction and the evolution of spoken accent” (2017–2022). Our thanks to the editor and four reviewers for their helpful and incisive comments.
The vowel formant track data created and analyzed for the current study as well as the code to run the ABM can be found at ftp://ftp.phonetik.uni-muenchen.de/pub/BAS/ABM/Antarctica.zip.
Adjusted for four tests using the Dunn–Šidák method (Šidák, 1967).