Since their introduction to sociolinguistics by Hay, Warren, and Drager [(2006). J. Phon. (Modell. Sociophon. Var.) **34**(4), 458–484], Pillai scores have become a standard metric for quantifying vowel overlap. However, there is no established threshold value for determining whether two vowels are merged, leading to conflicting *ad hoc* measures. Furthermore, as a parametric measure, Pillai scores are sensitive to sample size. In this paper, we use generated data from a simulated pair of underlyingly merged vowels to demonstrate (1) larger sample sizes yield reliably more accurate Pillai scores, (2) unequal group sizes across the two vowel classes are irrelevant in the calculation of Pillai scores, and (3) it takes many more data than many sociolinguistic studies typically analyze to return a reliably low Pillai score for underlyingly merged data. We provide some recommendations for maximizing reliability in the use of Pillai scores and provide a formula to assist researchers in determining a reasonable threshold to use as an indicator of merged status given their sample size. We demonstrate these recommendations in action with a case study.

## I. INTRODUCTION

Quantifying vowel overlap is an important component of many linguistic studies, ranging from sociolinguistics to laboratory phonology. While various methods of quantifying vowel overlap have been proposed [see, e.g., Nycz and Hall-Lew (2013)], Pillai scores have emerged in recent years as the most commonly used method, particularly within sociolinguistics (Hay *et al.*, 2006; Nycz and Hall-Lew, 2013), because of their ability to measure a distinction in multivariate space while also accounting for fixed effects like phonological context. However, there is no standard value of Pillai score that is broadly accepted to be a threshold for “merged” or “distinct”; indeed, individual studies make this determination primarily by comparison between individuals in a single data set, to show that some speakers are “more merged” while others are “less merged.” In this paper, we provide a critical look into how Pillai scores are calculated and reported.

We focus on demonstrating how sample size plays a major role in the resulting Pillai score. We also show that the common approach of reporting Pillai scores alone is incomplete without also reporting sample sizes and *p-*values. Using simulation data drawn from an underlyingly merged data set, we show how larger samples produce lower Pillai scores (in other words, a more “merged” score). We further demonstrate that within Pillai, it is the total *n* across both samples that matters and provide a formula that researchers can use to determine a threshold for “merged” given their own sample size. We highlight some important takeaways about using Pillai scores to measure vowel overlap and the potential risks for across- and within-study comparisons of speakers. We end with a case study that demonstrates how researchers can implement the recommendations in this paper when analyzing real data from sociolinguistic interviews.

### A. Pillai as a measure of vowel overlap

#### 1. Pillai score overview

A multivariate analysis of variance (MANOVA) is an extension of the (univariate) analysis of variance (ANOVA). The difference is that while an ANOVA evaluates whether the difference between two or more groups in a single numeric variable can be predicted by some number of categorical independent variables (such as the F2 of /u/ across older and younger participants), a MANOVA can evaluate two or more dependent numeric variables simultaneously (such as F1, F2, F3, and duration of /u/ by older and younger participants). So, a researcher analyzing American English vowels (which typically are differentiated primarily on vowel quality in F1-F2 space) could use a MANOVA to see whether a speaker pronounces two historically distinct vowel classes differently, while including the effects of duration and place of articulation as independent variables. In this case, F1 and F2 would be the dependent variables, and vowel class, duration, and place of articulation would be the independent variables.

Simplified somewhat, the null hypothesis of a MANOVA is that category membership (for instance, two historically distinct vowel classes) offers no explanatory power for any of the dependent variables. In the case of a MANOVA that is fit to vowel data, the null hypothesis is that the two vowels are merged. In other words, there would be no way to guess which historic vowel class a particular token came from by its acoustic measurements alone. Typically, the researcher's aim is to find evidence to reject that hypothesis. We note that high *p-*values associated with MANOVAs only indicate a lack of evidence to reject the null hypothesis that the two vowel classes are the same, rather than evidence *for* the null hypothesis. MANOVA cannot prove that two vowels are merged, only that there is little evidence to suggest that they are distinct.

There are four main test statistics associated with a MANOVA to compare what the data shows to the null hypothesis: Wilk's lambda, the Lawley-Hotelling trace, Roy's largest root, and the Pillai-Bartlett trace. Of these four, the lattermost is the most robust for non-normally distributed data and other violations of the assumptions of a MANOVA for tests that compare more than two groups (Olson 1976). In sociophonetic data, where it is more common to compare only two groups, the Pillai-Bartlett trace has less advantage over the other three in statistical validity but does benefit from being both easy to run in commonly used statistical environments and relatively easy to interpret. For introductory overviews of these four test statistics, including their mathematical definitions and conceptual explanations, see Bray and Maxwell (1985) (pp. 27–29), Johnson and Wichern (2012) (p. 336), Rencher and Christensen (2012) (pp. 169–188), and Upton and Cook (2014) [“multivariate analysis of variance (MANOVA)”].

The Pillai-Bartlett trace, often called the *Pillai score* or occasionally just *Pillai* in linguistics studies (a convention that we adopt here), ultimately comes from Pillai (1955) and Bartlett (1939). In the simplest model, which predicts two dependent variables (e.g., F1 and F2) using a single two-level categorical variable (e.g., /ɑ/ and /ɔ/), it returns a value that ranges between 0 and 1, with smaller values occurring when there is greater overlap between the two groups in multivariate space, and larger numbers for less overlap. In other words, small Pillai scores suggest a vowel merger. In reality, determining whether a merger is present is not quite as simple as merely observing overlap. Two vowels may occupy the same F1-F2 space but the distinction between phonemes may be maintained though some other cue like voice quality (Di Paolo and Faber, 1990), duration (Labov and Baranowski, 2006), or vowel trajectory (Stanley, 2020). We adopt a simplified approach here since our focus is to explore the effects of sample size, but we acknowledge that there is more to merger than overlapping midpoints in the F1-F2 space.

#### 2. Meta-analyses of Pillai scores vs other metrics

Prior to the introduction of Pillai scores to sociophonetics, perhaps the most common way to assess merger was through auditory coding. In its most basic form, the phonetically trained researcher would listen and evaluate whether there was a difference between the two sounds. However, as shown below, Pillai scores are most commonly used on low vowels, which are notoriously difficult for fieldworkers to accurately transcribe [Johnson (2010), pp. 28–29]. In fact, Moulton (1968) (p. 464) rather strongly states that early fieldworkers for the Linguistic Atlas Projects were “hopelessly and humanly incompetent at transcribing phonetically the low and low back vowels that they heard from their informants.” Fortunately, formulaically quantifying overlap provides a less subjective measure for vowel merger.

While Pillai scores are currently the most common metric for quantifying mergers, especially within sociolinguistic work, there are also other approaches available. In addition to the auditory coding mentioned above, much early work used the Euclidean Distance between the point representing the mean F1 and mean F2 of one vowel class and the point representing the mean F1 and mean F2 of a second vowel class as the primary metric for vowel merger (Hay *et al.*, 2006; Nycz and Hall-Lew, 2013; Han and Kang, 2013; Hall-Lew, 2013). This early approach is not particularly satisfying, as it fails to take into account the distributional properties of the data, including the degree of overlap and the distribution of tokens within a vowel class [Kelley and Tucker (2020), p. 137]. The spectral overlap assessment metric (SOAM) (Wassink, 2006), which calculates overlap between ellipses or ellipsoids fitted to the vowel distribution [see Wassink (2006) for details on the fitting] and calculates the area or volume of their overlap, is one method adopted and recommended in other sociophonetic work [Di Paolo *et al.* (2011), p. 103 and Kendall and Fridland (2021), p. 56]. We refer interested readers to Nycz and Hall-Lew (2013) and Kelley and Tucker (2020) for in-depth assessments of these measures and several others, compared with Pillai scores. Kelley and Tucker assess four different metrics of vowel overlap, using a Monte Carlo simulation to test accuracy, and find that Pillai scores produced the most accurate and precise values when compared to ground truth values in their simulated data. They also recommend Pillai scores when sample sizes are “small,” which they define as 30 observations per group.

For sociophonetic data, and especially for naturalistic sociophonetic data, obtaining more than 30 observations per group is not always feasible, making Pillai an especially valuable tool for sociophonetics. As perhaps foregrounded by the title of this current paper, sample size plays an important role in the resulting Pillai score, which we highlight in detail in Secs. II and III. We note, however, that this is generally true of all measures of vowel overlap compared in the overview papers mentioned above (Nycz and Hall-Lew, 2013; Kelley and Tucker, 2020). Indeed, any measure that uses means, standard deviations, and/or variance as inputs will be impacted by sample size. Likewise, larger sample sizes impact statistical significance, with larger sample sizes leading to smaller *p*-values. Our aim here is not to simply recommend that researchers obtain larger sample sizes, which in many cases is either not possible or may be at odds with other important data collection considerations, but rather to elucidate *how* sample size impacts resulting scores, so that researchers can be best informed when using Pillai as a measure of vowel overlap.

Perhaps an even bigger concern than total sample size with naturalistic data is the near impossibility of obtaining balanced token counts across two categories from naturalistic data. In wordlist data, researchers can carefully craft their wordlist to obtain balanced data: this typically means obtaining the same number of observations in each vowel class and ensuring that the wordlist is balanced for additional factors, such as phonological context. In naturalistic (i.e., conversational) speech, there is no way to ensure that a speaker produces balanced token counts across vowel category and across phonological contexts. As a result, researchers investigating something like vowel overlap need to understand precisely *how* Pillai scores may be affected by unbalanced data.

Finally, while Pillai has emerged as the best option so far for most sociophonetic data, there is one additional concern to address, which is that MANOVAs assume the data within each group follow a multivariate normal distribution (Bray and Maxwell, 1985). Formant measurements for a given vowel from a given speaker in naturalistic vowel data, being nonnormally distributed, do not fit this assumption [though see Whalen and Chen (2019) for some evidence that vowel formant data, even with coarticulation, can be normally distributed]. Recent work has suggested Bhattacharyya's affinity^{1} as a better measure for non-normally distributed data (Fieberg and Kochanny, 2005; Johnson, 2015), and has been taken up by some subsequent work (Strelluf, 2018; Warren, 2018; Jensen and Braber, 2021). While the benefits of Bhattacharyya's affinity being a nonparametric measure make it particularly appealing for the kind of distributions of data found in naturalistic speech, there is not yet a mechanism for easily incorporating fixed effects like phonological context or speaker age.^{2} Furthermore, Bhattacharyya's affinity also works best on a relatively large data set of over 30 observations per category (Seaman *et al.*, 1999). While future work may make nonparametric methods like Bhattacharyya's affinity more easily integratable with the kind of statistical models linguists typically use, for now we focus on Pillai score and the considerations necessary to make its use maximally standardized.

#### 3. How Pillai scores are used to measure overlap in sociophonetics

Pillai scores have been used to analyze a variety of phenomena in several languages. Perhaps the most common application of Pillai scores is to measure overlap between /ɑ/ and /ɔ/ in North American English (Hall-Lew, 2013; Kendall and Fridland, 2017; Havenhill, 2015; Stanford *et al.*, 2019; *inter alia*). In fact, using Pillai scores to measure this merger was explicitly recommended in Becker (2019), a volume of different studies all analyzing the spread of the /ɑ-ɔ/ merger in North American English. Many conditioned mergers in English have been quantified with Pillai scores as well (Schmidt *et al.*, 2021; Austin, 2020; Freeman, 2021; Newbert, 2021; *inter alia*). To a lesser extent, Pillai scores have been used to analyze vowels that are marginally contrastive [Galician: Amengual and Chamorro (2016); Italian: Nadeu and Renwick (2016); Bangla: Islam and Ahmed (2020); Hawai'ian: Kettig (2021); Swiss German: Joo *et al.* (2018); Austrian German: Sloos (2014)]. Some more innovative uses of Pillai scores include using them to analyze tones in varieties of Cantonese (Fung and Lee, 2019; Tse, 2018) and in Spanish fricative mergers (Regan, 2020).

Pillai scores have also been used to quantify splits, chiefly among phonological low vowels. Fisher *et al.* (2015) assess the degree to which the Philadelphia short-*a* split is found in their sample of Philadelphians. Relatedly, Hall-Lew *et al.* (2017) look at the BATH-TRAP split in Scottish Parliament data. Brozovsky (2020) uses Pillai scores to measure the raising (and separation) of prenasal /æ/, using Pillai scores to measure the overlap between prenasal /æ/ and preobstruent /æ/ in Taiwanese Texans. In a study looking at the possible effect of salience on a given lexical item compared to the rest of its canonical vowel class, Bray (2021) analyzes the lowered realization of the vowel in *hockey* compared to other relatively more raised /ɑ/ tokens in professional American hockey players.

As highlighted by the selection of citations above, since being introduced to the field, Pillai scores have become widespread in sociophonetic studies. With the support of meta-analyses that compare other competing measures, Pillai has become a useful go-to for measuring both the overlap and the distinction of speakers' pronunciation of two phonological categories, especially in sociolinguistic studies that need to compare across individual speakers. While Pillai scores are clearly a valuable tool in measuring vowel overlap, there remain some outstanding issues with using it. In the following sections, we highlight some of these issues.

### B. Issues with Pillai scores

#### 1. What is considered merged?

As useful as Pillai scores are for quantifying the degree of overlap, they do not necessarily answer researchers' underlying question of whether two vowels are merged. Pillai scores range from 0 to 1, but there is no agreed-upon cutoff value or threshold for determining whether the two groups are merged or not. As a result, many studies rely on an *ad hoc* threshold to interpret the merged status of their speakers. Some work has suggested specific thresholds for mergers. For instance, Jibson (2021) suggested a Pillai threshold of 0.3 as an indicator of “merged” status, after a shuffling procedure identified 0.3 as the 95th percentile of “merged” between 20 tokens of two vowel classes from his speakers. Wassink (2006) likewise suggests some provisional thresholds for SOAM, where 0%–20% overlap represents “distinct,” 20%–40% represents “partially merged,” and >40% represents “merged.” Relying on provisional or *ad hoc* thresholds, however, is risky because sample size is likely not comparable across studies or even between speakers.

One solution for determining whether a given Pillai score should be interpreted as an indication of “merger” is to examine the *p*-values that are associated with the MANOVA model from which the Pillai scores are generated. The model assumes that the vowel variable contributes no information to differentiating between two groups. In other words, it assumes the two vowels are underlyingly merged. A small *p*-value associated with the vowel variable would provide evidence against that null hypothesis, allowing the researcher to conclude that the difference between the two groups is likely true (i.e., that the speaker is not merged). We note that Pillai scores and *p*-values are inversely correlated: lower Pillai scores typically accompany higher *p*-values. In fact, since Pillai scores are just test statistics, they and *p*-values are functions of each other. Nevertheless, *p*-values are not typically reported in sociophonetic studies that use Pillai [some exceptions include Wong and Hall-Lew (2014), Nadeu and Renwick (2016), Amengual and Chamorro (2016), Berry (2018), and Sloos (2014)].

There are a few points of caution to make about using and interpreting *p*-values, as we discuss throughout this paper. One of these concerns the potential distinction between a statistically significant difference (as defined by the model) and a ground truth difference for speakers. For a speech community that has a ground truth merger in two vowel categories, there will be no difference in their perception of these two vowel categories. However, as sample size increases, so does the likelihood of a model returning a *p*-value below a given significance threshold (typically 0.05); it is possible that even for pairs of sounds that are truly merged, a sufficiently large dataset can interpret random variance in the data as a meaningful difference, as shown in the experiments in Sec. III. An additional caution to make regarding interpreting *p*-values alone as an indicator of merger is that a statistically significant difference in formant values may not map onto a perceptible difference for the human auditory system [see, e.g., Kewley-Port and Watson (1994)]. *P*-values alone can likewise be misleading in the opposite direction: previous work has shown that speakers and listeners can produce and perceive reliable but small differences, including sub-phonemic differences, in what may otherwise appear to be merged sounds [for instance, with cases of incomplete neutralization; Warner *et al.* (2004) and Pfiffner (2021)]. Vowel distinctions that are maintained by small effect sizes, or by sub-phonemic distinctions not captured by the measurements in the model, may appear artificially to be merged according to a *p*-value because it takes more data for smaller differences to be detected by the statistical model. For these reasons, additional information such as Pillai scores can aid in the interpretation of *p*-values, and vice versa.

#### 2. Sample size

As suggested by the parametric nature of Pillai, and the discussion of *ad hoc* thresholds above, a major component of deciding which threshold should be used to determine merger status is the number of tokens being analyzed. Previous work on Pillai scores and sample size have expressed concern over too-small sample sizes (Gorman and Johnson, 2013), and over unbalanced sample sizes across the two vowel classes being analyzed (Nycz and Hall-Lew, 2013; Johnson, 2015).

Despite sample size having a major impact on Pillai scores, most studies in sociophonetics that use Pillai as a measure of vowel overlap do not also clearly report sample size [exceptions include Wong and Hall-Lew (2014), Holland and Brandenburg (2017), Berry (2018), and Berry and Ernestus (2018)]. This in turn makes it difficult both to assess the findings in an individual paper and to compare speakers within and across studies. In the simulation experiments below, we show just how important sample size is to the resulting Pillai score and provide a formula to calculate a recommended Pillai score threshold for merger status, given a particular sample size.

## II. METHODS

In this section, we present the results of Monte Carlo simulation experiments designed to test the effect of different sample sizes on resulting Pillai score. In these simulations, we create two vowel classes that are perfectly merged underlyingly, and alter (1) the sample sizes between the two vowel classes to test the effect of unbalanced samples across vowel classes, (2) the overall sample size, considering both vowel classes together, to test the effect of unbalanced total sample size across speakers, and (3) the correlation between the simulated F1 and F2 formant frequencies, to test whether correlations (like those typically found in naturalistic vowel data) influence the results.

### A. Data generation

We generated data using a Monte Carlo simulation (Metropolis and Ulam, 1949). This is a procedure where random draws are taken from an underlying probability distribution or existing dataset and analyzed. This process is repeated independently many times and the information about each iteration is aggregated. To begin the simulation, a bivariate normal distribution^{3} was generated in r to simulate a single theoretical underlying vowel in the F1-F2 dimension for a single theoretical speaker. For the sake of simplicity, the mean for F1 and F2 were both set to zero and the standard deviation was 1. In experiments 1 and 2, the correlation coefficient between the formants was zero, producing a circular (rather than elliptical) distribution. In naturalistic speech, however, vowel data typically has some degree of correlation between F1 and F2, resulting in elliptical distributions, so in experiment 3, we manipulated the correlation coefficient between F1 and F2, to test whether the results from experiments 1 and 2 hold for data with different degrees of correlation. To simulate vowel data, random draws were taken from that single bivariate normal distribution and assigned to one of two “vowel class” labels. We note that it is somewhat nonsensical to refer to these generated numbers as “vowels,” especially since Pillai scores can be calculated on non-vowel data. However, since the majority of Pillai scores in sociophonetic analyses are based on vowels, for clarity, we will refer to this simulated data points as “vowels” and their arbitrary groups as “vowel classes.” These random draws represent a linguist sampling data from our theoretical speaker. For the sake of illustration, we will say 30 such observations were generated. These 30 observations were treated as tokens from a single underlying vowel class.^{4} Another 30 random draws were then taken from the same bivariate normal distribution. These 30 observations were treated as tokens from a different underlying vowel class. Generating two groups from the same underlying distribution therefore creates a simulated pair of merged vowels. In theory, the two simulated vowel classes should not be statistically different from each other in any way because they were drawn from the same underlying distribution.

### B. Three experiments

For this study, we ran three experiments. In experiment 1, the two simulated vowel classes for each “speaker” were of equal size. We began with a sample size of 5 observations per vowel class. We then moved on to 6 observations per vowel class, and so on, until we reached 100 observations per vowel class. For each of these 96 sample sizes, we repeated the simulation 1000 times, each representing a different instance of a linguist sampling data from that one underlyingly merged speaker. This produced 96 000 pairs of simulated vowel data, where each pair consisted of equal-sized vowel classes, enabling us to test the effect of overall sample size on resulting Pillai score.

In experiment 2, we varied the sample size between the two vowel classes for each “speaker.” We began with 5 tokens from one vowel and 6 from another. We then took 5 tokens of one and 7 from the other. We increased the size of the second group by steps of 1 until it contained 100 tokens. We then repeated this process with the first group having 6 tokens and increased the second group from 5 to 100 in steps of 1. We iterated over these steps, increasing the sample size of the first group up to 100, thereby generating pairs of vowel data where every combination of sample sizes from 5 to 100 was represented. We repeated this simulation 100 times per combination. This produced 921 600 pairs of simulated vowel data, where each pair consisted of different-sized vowel classes, enabling us to test the effect of unbalanced vowel class size on resulting Pillai score.

In experiment 3, we then varied the correlation between the simulated formant data, modifying the shape of the underlying bivariate normal distribution from circular to elliptical using the mvrnorm() function in the mass package (Venables *et al.*, 2002). As with experiments 1 and 2, this produced an underlyingly merged pool, which we could then sample from to generate our two “vowel classes.” Following the methods described in experiment 2, we generated data with sample sizes ranging from 5 to 100 tokens per vowel, though for the sake of processing time, we only chose samples that were multiples of 5. For each combination of sample sizes then, we generated datasets with correlation coefficients ranging from 0 to 0.9, in intervals of 0.1. 100 vowel pairs per combination of sample sizes and correlation coefficients were produced, resulting in 361 000 new sets of simulated vowel data, enabling us to test the effect of correlation on Pillai scores, in conjunction with sample sizes and unbalanced vowel class sizes.

Across all experiments, Pillai scores were calculated for each pair of simulated vowel data. Pillai scores were calculated by fitting a MANOVA model to the data using the manova() function in r. The simulated F1-F2 measurements were the dependent variables, and the vowel class was the only independent variable. While it would be possible to incorporate additional simulated independent variables such as place of articulation and duration into the MANOVA, we consider this to be beyond the scope of the current paper, which focuses on the effect of sample size on resulting Pillai scores. We therefore include only historical vowel class in our models and leave more complex MANOVA models to future work. The Pillai scores and *p*-values associated with the vowel class variable were then extracted from that MANOVA model. To reiterate, the Pillai scores for all of these distributions should be very close to zero (indicating complete overlap) because every vowel pair was generated from the same underlying bivariate normal distribution. Because the data is randomly generated, some Pillai scores will be higher than others, but by rerunning the simulation many times per sample size, we can begin to see patterns that may emerge at a given sample size.

The coding for this study was done in the r programming language (R Core Team, 2021) with the help of the tidyverse suite of packages (Wickham *et al.*, 2019) and joeyr (Stanley, 2021). Visualizations were generated using ggplot2 (Wickham, 2015; Lüdecke *et al.*, 2021) with color palettes from ggthemes (Arnold, 2018) and scico (Pedersen and Crameri, 2020).

## III. RESULTS

### A. Experiment 1: Equal sample sizes

To address how sample size affects Pillai scores, we first present the results from experiment 1, where the two simulated vowel classes were the same size. Before inspecting the results of all sample sizes though, it is important to understand how the 1000 Pillai scores were distributed within a given sample size. Figure 1 shows two different views of the distribution of Pillai scores when the sample size for both groups was 10. We see that the distribution of points representing resulting Pillai scores is rather wide, a consequence of using such a small sample size for inferential statistics, ranging from less than 0.001 to 0.568. Much of the data is clustered near the bottom of the plot but there is a long “tail” extending upwards. This is not a haphazard pattern, but rather follows a distribution that can be transformed into an *F* distribution and reflects the underlying mathematical properties of how Pillai scores are calculated [cf. Rencher and Christensen (2012), p. 182]. For this particular sample size, the mean Pillai score was 0.104, the median was 0.077, and the 95th percentile was 0.294. As seen below, these numbers change depending on the sample size, but the underlying distribution of the Pillai scores is consistent across sample sizes. We show this distribution to dispel any misconceptions that Pillai scores are uniformly distributed within a particular range, and to highlight that generally they fall near the lower end of the distribution.

With that distribution in mind, we can now zoom out to view all samples at once. Figure 2 shows all 96 000 Pillai scores by their sample size. Though present at all sample sizes, the “bottom-heavy” distribution shown in in Fig. 1 is not displayed in Fig. 2, in order to make the general trend across samples easier to see. It is immediately apparent that larger groups more consistently produced lower Pillai scores. With very small sample sizes (groups of fewer than 10 observations per vowel class), Pillai scores were quite high. For these small sample sizes, Pillai scores were sometimes closer to 1 than 0, even for these underlyingly merged vowel classes which should, in principle, return a Pillai score of 0. In other words, these small sample sizes sometimes resulted in very misleading Pillai scores that may cause a researcher to interpret two vowel classes as distinct even if the true underlying distribution was perfectly merged. As the sample size increases, Pillai scores were more consistently low, as we would expect for underlyingly merged vowels.

The black line overlayed on Fig. 2 indicates the 95th percentile for each sample size. This line also very closely corresponds to the threshold for vowel class being statistically significant in the MANOVA models: almost all points above that line had *p*-values less than 0.05 while almost all points below it had greater *p-*values. Because we are modeling the null hypothesis (i.e., *underlyingly* merged vowel classes), the distribution of *p*-values is uniform. It therefore is unsurprising, and in fact expected, that the highest 5% of Pillai scores within a given sample size also return *p*-values less than 0.05. This line shows that if there are just 10 observations per vowel class, 95% of the Pillai scores were under 0.3. However, as is evident in Fig. 2, a threshold of 0.3 is only applicable to a sample size of 10 per group since Pillai scores decrease with larger samples. For example, the 95th percentile of returned Pillai scores does not drop to 0.1 until there are 30 observations per group.

### B. Experiment 2: Unequal sample sizes

While the previous section found that sample size affects Pillai scores in a predictable way, with larger samples returning more reliable Pillai scores, in this section we conduct further simulations to explore what effect, if any, an unbalanced sample has on Pillai scores. Unless data collection is carefully controlled to include a fixed number of tokens per vowel class, Pillai scores are run on vowel classes that are not comprised of the same number of tokens. Here, we ask: what effect do grossly unbalanced groups have on Pillai scores?

Figure 3 shows the mean Pillai scores from the 100 simulations for each combination of sample sizes between the two vowel classes. Going from the bottom left corner (two small sample sizes from each vowel class) to the top right corner (two large sample sizes from each vowel class), we see a general trend of decreasing Pillai scores as sample sizes increase. Reflecting Fig. 2, we see these decreasing Pillai scores dropping more sharply as sample sizes are small (under around 30 tokens per vowel class).

A surprising pattern emerges when we look closely at the resulting Pillai scores from unequal sample sizes: namely, that unequal samples across the two vowel classes do not impact Pillai score—it is only the *total* sample size taking both vowel classes together that matters. For example, the mean Pillai score for a pair of 10 tokens and 50 tokens drawn from this underlyingly merged distribution is 0.0358 and the mean Pillai score for a pair of 20 and 40 tokens is 0.0355. A pair of 30 tokens and 30 tokens drawn from an underlyingly merged distribution is nearly identical: 0.0381. We see this pattern visually reflected along the diagonal between the top left corner and the bottom right corner of Fig. 3, which shows a symmetrical resulting Pillai score for all unequal pairs of samples that sum to the same total. In other words, we find that neither the existence of an unequal sample size between vowel classes nor the degree of unequalness impact Pillai score.

These findings bring us to recommend simply using as many tokens as researchers have available for an individual speaker, to bring the *total* sample size as high as possible. As seen in Sec. V below, when comparing across speakers, it is important to recall that total sample size impacts the resulting Pillai score, and we recommend normalizing and reporting total sample sizes across speakers in a study to make resulting Pillai scores maximally comparable.

### C. Experiment 3: Correlated F1 and F2 values

One potential caveat for experiments 1 and 2 is that they are based on uncorrelated F1 and F2 values (producing a circular distribution in F1-F2 space), while in naturalistic speech vowel formant data is typically correlated (elliptical in F1-F2 space). To evaluate whether the correlation of the vowel formants affects Pillai scores, our final experiment explicitly tests the effect that correlated dependent variables have on Pillai scores by generating bivariate normal distributions with various degrees of correlation. Like the previous two experiments, Pillai scores were calculated, only this time the effect of sample size, unequal group sizes, and correlation were explored.

Since this experiment modified three variables (group 1 size, group 2 size, and correlation), visualization of all the data is less straightforward. Instead, we ran a linear regression model on these Pillai scores, with the log-transformed Pillai score as the dependent variable and the log-transformed total sample size and the correlation as predictors (Table I). We find that correlation was not a significant predictor in these 361 000 simulations, meaning there was no significant difference between the Pillai scores of uncorrelated data and correlated data, given a particular total sample size. We also find that unbalanced data did not affect Pillai scores, even in these correlated datasets. Thus, we can be confident that our results, and the implications and recommendations drawn from them, are applicable even to correlated, unbalanced data like what is typically found in real vowel formant measurements.

. | Estimate . | Std. Error . | t-value
. | p-value
. |
---|---|---|---|---|

Intercept | 0.243 484 | 0.023 980 | 10.153 | <0.001 |

log(n) | –1.022 667 | 0.005 114 | –199.965 | <0.001 |

Correlation | –0.003 328 | 0.007 372 | –0.451 | 0.652 |

. | Estimate . | Std. Error . | t-value
. | p-value
. |
---|---|---|---|---|

Intercept | 0.243 484 | 0.023 980 | 10.153 | <0.001 |

log(n) | –1.022 667 | 0.005 114 | –199.965 | <0.001 |

Correlation | –0.003 328 | 0.007 372 | –0.451 | 0.652 |

## IV. IMPLICATIONS

### A. Choosing a threshold based on sample size

As a result of these simulations, we are in a position to recommend a formula that researchers can use as a guide to determine whether a given resulting Pillai score indicates a merger. We sought a formula that was a function of sample size and could provide the 95th percentile for Pillai scores given that two groups come from the same underlying distribution. In other words, we wanted to find the function for the black line in Fig. 2. We chose the 95th percentile for this distribution to align with the common benchmark of *p* < 0.05.

The formula is based on the observation that by taking the natural log of the Pillai score and the natural log of half of the total sample size of both groups (essentially log-transforming the axes in Fig. 2), the 95th percentile per sample size followed a straight line with an intercept of 1 and a slope of –1. Because “half of the total sample size” can be confusing, we begin by simplifying this to a variable *m*, which represents *n*/2 where *n* is the total sample size across both groups. In other words, *m* is the average sample size per group. These intercepts were determined by rerunning Experiment 1 a hundred more times and fitting a simple linear regression to the 95% confidence interval for the Pillai scores in each iteration. We found that a line with an intercept of 1 and a slope of –1 was within the 95% confidence interval of those parameters approximately 95% of the time. We therefore assume that these parameters are correct in the probability density function for the line we seek to model. The formula therefore begins as *pillai* = 1 – *m.* However because this straight line only makes sense when both *x* and *y* are log-transformed, it must be modified to be log(*pillai*) = 1 – log(*m*). At this point, we solve for *pillai*, using e^{x} as the antilog function, $p95=e1\u2212log(m)$. Reducing^{5} produces the formula that we recommend for determining a Pillai score cut off,

where *p*_{95} is the 95th percentile of Pillai scores given the average sample size per group *m*. The curve that is generated by Eq. (1) very nearly follows the black line in Fig. 2, which represents the 95th percentile of Pillai scores given the average sample size (which is the same as the sample size for each group in that figure since both groups were the same size). We believe this formula can be fruitfully used to determine a reasonable cutoff for “merged” status for a given total sample size.

Using Eq. (1), we see that, for example, a sample of 20 total observations would produce a Pillai score less than or equal to 0.2718 95% of the time if the two vowel classes were underlyingly merged [we note that this is a close approximation to the cutoff of 0.3 that Jibson (2021) chose for his total sample size of 20]. For illustration, Table II presents threshold suggestions drawn from this formula for different total sample sizes, highlighting that it takes a great deal of data to reliably return Pillai scores that are close to zero for underlyingly merged data (recall that Pillai scores closer to zero reflect a more “merged” production). We recommend that researchers use Eq. (1) in conjunction with *p*-values to make a more informed decision about the merged status of two vowel classes rather than an *ad hoc* or arbitrary cutoff.

Total sample size (n)
. | Mean group size (m)
. | Pillai threshold for “merged” . |
---|---|---|

20 | 10 | 0.2718 |

40 | 20 | 0.1359 |

60 | 30 | 0.0906 |

80 | 40 | 0.0680 |

100 | 50 | 0.0544 |

120 | 60 | 0.0453 |

Total sample size (n)
. | Mean group size (m)
. | Pillai threshold for “merged” . |
---|---|---|

20 | 10 | 0.2718 |

40 | 20 | 0.1359 |

60 | 30 | 0.0906 |

80 | 40 | 0.0680 |

100 | 50 | 0.0544 |

120 | 60 | 0.0453 |

### B. Sample size matters across speakers but not across vowels

One important takeaway from these simulations is that it takes a relatively large amount of data to reliably (meaning 95% of the time) return a low Pillai score such as 0.1 from two underlyingly merged vowels. In an analysis of English /ɑ/ and /ɔ/, for example, conversational data can typically provide a sufficient number of observations for a robust analysis of overlap. However, few studies that analyze wordlist data contain many more than 10 tokens of these two vowels. Even within long-form conversational data, if the research question focuses on an infrequent phonological variable, the total sample size may never be large. Because total sample size has a major impact on the resulting Pillai score, we recommend researchers choose a relevant threshold for “merger” status, based on their total sample size, and use it in conjunction with the resulting *p*-value to make a determination about merger status for their data.

Perhaps the most surprising takeaway from these simulations, for both authors, was that although Pillai is not a nonparametric test, it does not actually matter if the token counts across the two categories being investigated are unequal. Instead, the most critical consideration is the total number of tokens, summed across both categories. This is particularly important for naturalistic sociolinguistic work, which relies on casual conversation rather than carefully constructed word lists for data, meaning that is it often not possible to obtain balanced token counts across categories. Following the results of experiment 2, we can reassure researchers that unbalanced tokens across vowel classes will not impact the resulting Pillai score. Instead, and following the results of experiment 1, we recommend using as many total tokens possible for an analysis of a single speaker, regardless of unbalanced samples across vowel classes for that speaker.

At the same time, any study aiming to compare the “merger” status of multiple individual speakers should take into account the total sample sizes for each speaker, and especially consider the fact that sample sizes (and therefore, the interpretation of resulting Pillai scores) may be different across speakers. One of the primary goals for a robust measure of vowel overlap in sociolinguistics has been to track the development of vowel mergers and splits across speakers in a given corpus (Nycz and Hall-Lew, 2013; Johnson, 2010; Strelluf, 2018; Labov *et al.*, 2016) to analyze the trajectory of large-scale language changeover time. Because distributions with lower token counts produce inflated Pillai scores, this means speakers who are less talkative will artificially appear to have more distinct vowels than speakers who are very talkative.^{6} We recommend using one of two ways to account for unequal sample sizes across speakers.

One option is for researchers to conduct an analysis of individual speakers, incorporating all the relevant pieces of evidence (the recommended Pillai threshold given an individual speaker's sample size, Pillai scores, *p*-values, and visualizations), to make a determination about the merged status of each speaker. Section V A presents an example of this method in action, where we demonstrate how we leveraged all these pieces of evidence together to try to understand the merged status of each speaker. This method allows researchers to obtain a fairly robust understanding of individuals as they compare to each other and across styles. However, since this method requires researchers to synthesize a number of gradient measures into discrete categories (such as “merged,” “ummerged,” and perhaps “partially merged”), it makes it more difficult to track fine grained changes in the development of a merger across a large speech community over time.

For researchers aiming to analyze fine grained differences over real or apparent time across many speakers, however, it may be more beneficial to keep Pillai as a gradient measure. To analyze Pillai across many speakers at once, we recommend analyzing the same number of tokens per speaker. Our recommendations for how best to do this are discussed in detail in Sec. V B, and an example of the r code used to apply this recommendation is provided in the supplementary file.^{7}

A similar issue arises when comparing merger status from a single individual speaker but across multiple speech styles. In particular, sociophonetic work often compares casual speech styles like conversation to more formal speech styles, such as a minimal pair list or a reading task. Critically for Pillai scores, since lower token counts will artificially appear more distinct, speech styles such as word lists and reading tasks with lower token counts will likewise artificially appear more distinct than speech styles with higher token counts (such as casual speech). In sum, there is a strong risk that a Pillai score difference between speech styles may be interpreted as a stylistic difference (whereby speakers appear to be maintaining a distinction in wordlists that they do not maintain in casual speech), when the effect is entirely driven by differences in total sample sizes across the two styles. It is not uncommon, for instance, to hear of speakers apparently undoing mergers in read speech in comparison to their casual speech [cf. Labov (1994), p. 80; Berry (2018); and Berry and Ernestus (2018)]; while Labov (1994) correctly points out that apparent unmergers in careful speech may be speakers hypercorrecting in response to orthographic differences across historically distinct vowel classes, we can add that distinctions measured by Pillai score will additionally be impacted by sample size. We urge researchers who use Pillai as a metric of overlap to pay close attention to differences in total sample sizes across speech style,^{8} and either control for total sample size across styles or adjust their threshold of “merger” in each style accordingly, following the same recommendations provided above and elaborated on in Sec. V.

### C. What to report when using Pillai scores

Finally, we end with some general recommendations for what researchers should include in their results when using Pillai scores as a measure of overlap, in addition to the model specification (i.e., the independent and dependent variables). First, because total sample size impacts Pillai score so strongly, we recommend always reporting total sample size per speaker (or per style, for studies that compare merger across speech styles). This practice will have the added benefit of enabling better comparisons across sociolinguistic studies, in turn enabling researchers to gain a clearer picture of the large-scale spread of some ongoing mergers such as the /ɑ/-/ɔ/ merger.

Second, in addition to reporting and controlling for total sample sizes, we recommend reporting *p*-values where possible. In other words, where studies report individual speakers' (or styles') Pillai scores, those should always include the total sample size and the *p*-value as well. We are not necessarily advocating for an increased reliance upon *p*-values as a binary meaningful threshold, particularly since some statisticians are urging quantitative researchers to abandon declarations of “statistical significance” and to instead understand *p*-values as one gradient measure in concert with additional evidence (Wasserstein *et al.*, 2019). However, reporting a Pillai score without its accompanying *p*-value is akin to reporting a regression estimate without a *p*-value, a *p-*value without an effect size, or a mean without a standard deviation. A Pillai score is a test statistic and should be reported as such. These two numbers in conjunction paint a better picture of the merged status of a pair of vowels than either one in isolation.

## V. A CASE STUDY

The following is a case study to illustrate how one might conduct an analysis of vowel merger using the recommendations in this paper. As illustrative data, we draw from sociolinguistic interviews conducted and analyzed by author J.A.S. with residents of southwest Washington State, a region where the low back vowels (/ɑ-ɔ/) are often merged, though with some indication of separation in some speakers [see Stanley (2020) for more details]. On average across the 52 participants analyzed here, interviews lasted 46 min and yielded 143 tokens containing either /ɑ/ or /ɔ/ in preobstruent position. After the interviews, 30 of those participants then read a wordlist containing another 20 tokens in a more careful style.

### A. Analysis of individual speakers

We performed a MANOVA on each speaker's F1 and F2 measurements, separately for each style, with historic vowel class as the only predictor. Given the number of observations produced by each speaker, Eq. (1) was implemented to establish a potential cutoff value for each style. The *p*-values and the Pillai scores were extracted from the MANOVA model and the latter were compared to the cutoff values.

In this sample, there are some cases where the evidence overwhelmingly points towards a merger. For example, 48-year-old Donna produced 179 low back vowels in the conversational portion of her interview. With that many tokens, Eq. (1) suggests that if her vowels were underlyingly merged, her Pillai score should be less than 0.0304 about 95% of the time. In other words, anything less than 0.0304 would be evidence for a merger. As it turns out, the MANOVA model performed on her data yielded a Pillai score of 0.0289 with a *p*-value of 0.0756. The fact that her Pillai score is less than the threshold for her token count and that the *p*-value above 0.05 means that historical vowel class category does not predict Donna's acoustic realization, suggesting that Donna's two vowel classes are likely underlyingly merged. In the wordlist data, there were only 20 tokens, so the suggested cutoff value determined by Eq. (1) is much higher at 0.2718 because of the smaller sample size. The MANOVA performed on Donna's wordlist data produced a Pillai score of 0.1648 (*p* = 0.216). We reiterate that sample sizes also impact *p*-values, with smaller sample sizes producing higher *p*-values. Taken all together, this data makes a strong case that /ɑ/ and /ɔ/ are underlyingly merged in Donna's speech in both speech styles.

On the other hand, some speakers' data are indicative of a distinction. Kim, another 48-year-old woman, produced 137 low back tokens in the conversational portion of her interview. Equation (1) suggests that a Pillai score less than 0.0397 would be evidence for a merger, but the MANOVA performed on her data returned a Pillai score of 0.0658 (*p* = 0.010). Though the Pillai score is relatively close to zero, we do not interpret her data as underlying merged, since with that many observations from a truly merged distribution we would expect an even lower Pillai score (less than 0.0397). Based on the 20 tokens from her wordlist, the cutoff would be 0.2718 but the MANOVA on those 20 tokens yielded a Pillai score of 0.3700 (*p* = 0.020), further suggesting a distinction. Because Kim's Pillai scores were higher than the thresholds and were accompanied by low *p*-values for both speech styles, we conclude that Kim's low back vowels, while close in acoustic space, are not fully merged.

However, even when considering *p*-values alongside recommended thresholds, not all cases are as straightforwardly interpretable as Donna's and Kim's. Scott is a 28-year-old man whose interview contained 195 low back tokens. The Pillai score based on his data were 0.1975, far higher than the threshold (0.0279) produced by Eq. (1) and was accompanied by a low *p*-value (*p* < 0.001), suggesting two distinct underlying vowel distributions. However, the Pillai score based on the 20 tokens he produced in his wordlist (0.0397) was much lower than the threshold (again, 0.2718), and had a high *p*-value (*p* = 0.7090). With the Pillai score lower than the threshold and accompanied by a high *p-*value his wordlist data, it is tempting to interpret Scott as being a rare case of producing a merger in the wordlist that he does not produce in the conversational portion of the interview. However, because the sample size is so small in the wordlist, the Pillai score (and indeed, any result of an inferential statistical test) should be taken with a large grain of salt. We include Scott's data here in part to demonstrate that even when leveraging Pillai scores alongside a recommended threshold and a *p-*value, data with low token counts can still be difficult to interpret.

We can add one additional tool to the suite of evidence we consider when diagnosing individual speakers: a visual inspection of a plot. Figure 4 shows the distributions of /ɑ/ and /ɔ/ tokens in F1-F2 space for conversational data (left) and wordlist data (right) for both Donna (top), Kim (middle), and Scott (bottom). Ellipses represent one standard deviation for each vowel class. A visual inspection of these plots shows that both vowel classes exhibit a fair amount of overlap in both styles, with what appears to be more overlap in Donna's, as her /ɔ/ vowel class actually encompasses /ɑ/ in both speech styles (a distributional property that indicates merger). Adding the measures of Pillai scores, using the recommended thresholds for “merged” given the specific token counts, along with *p*-values for these speakers in these two styles, enables us to more confidently state that Donna's two vowels are underlyingly merged, while Kim's are distinct in both styles and Scott's are distinct at least in the conversational style. For both Donna and Kim, the Pillai scores were higher in the wordlist style compared to the conversation. Seeing these differences in Pillai scores alone, a researcher may be tempted to conclude that there is style shifting occurring for both speakers, such that the merger undoes itself in more careful speech. Leveraging all of the evidence together—Pillai scores alongside the recommended threshold for a given sample size, as well as *p*-values and a visual inspection of the data—allows us to reject this interpretation and instead see important differences between speakers in the sample.

We note, in fact, that interpreting Pillai score alone without the additional evidence of threshold and *p*-value provides a misleading interpretation of the entire dataset. Across the 30 speakers in the sample who completed both tasks, a one-sided paired *t*-test comparing Pillai scores in conversational data to Pillai scores in wordlist data suggests that there is a statistically significant trend towards higher Pillai scores (i.e., a “less merged” pronunciation) in the wordlist data (*t* = –3.3296, df = 29, *p* = 0.001). This pattern obtains across most speakers, suggesting on the surface that almost every speaker in the sample “unmerges” their vowels in wordlist style data or that they only merge the vowels in casual conversation. Given that the low-back merger is a change in progress [cf. Labov *et al.* (1972) for other examples], this pattern obtaining across this many speakers of all ages is suspiciously regular—much more regular than we would expect given patterns of variation and change in sociophonetic work. In fact, it was this apparently regular unmerging found in these participants' wordlists that led us to investigate the effect of sample size in the first place. Incorporating all of the relevant pieces of evidence (recommended threshold given sample size, along with Pillai scores and *p*-values) allows us to understand the suspiciously regular finding as an artefact of wordlists having far smaller sample sizes than conversational speech. Likewise, leveraging all of the evidence, including sample size differences across styles, allows a clearer understanding of the individual speakers in the sample and whether they are likely to be truly merged or not.

### B. Analysis of many speakers

While the level of scrutiny in Sec. V A is appropriate and encouraged for analyzing individual speakers, we acknowledge that researchers may have a different goal in mind. For instance, researchers tracking the development of a merger as it becomes closer together in phonetic space (and before a categorical merging) will want to understand how the two vowel classes become less distinct across generations—even before a categorical merger has taken place. Analyzing data speaker-by-speaker requires us to take gradient measures (Pillai scores, threshold, and *p*-value) and interpret them into discrete categories for each speaker and style (“merged,” “not merged,” or “partially merged,” depending on the researcher's interpretation); this discrete interpretation in turn makes it difficult to track how large-scale change proceeds across generations. In this section we explore how Pillai scores have changed over time in this sample by fitting a linear model to the data after implementing a bootstrapping procedure to remove difference between sample sizes across speakers.

To ensure that Pillai scores are comparable across a large number of individual speakers, we recommend reducing the size of each speaker's dataset down to the sample size of the speaker who contributed the least amount of data. Downsampling in this way will allow us to obtain Pillai scores that are comparable across all speakers, and therefore allow us to track fine-grained changes in merger status across apparent time.

To begin our downsampling example, we first restrict our analysis to the more prolific conversational portion of the interview, to maximize the possible total sample size per speaker. The least talkative speaker in this sample produced only 82 tokens in this style. So, even though some speakers produced many more (as many as 327 in one case), we took a random sample (with replacement^{9}) of just 82 tokens from each speaker. However, we found that Pillai scores from a random sample of one speaker's data varied considerably from a different random sample from that same speaker. So, we implemented a bootstrapping procedure and took 1000 random samples (with replacement) of 82 tokens from each speaker's dataset, calculated the Pillai scores and other summary statistics from each sample, and aggregated them by speaker. The correlation between speakers' Pillai scores on the full dataset and speakers' mean Pillai scores across the 1000 samples was 0.9984, suggesting that the aggregated bootstrapped values are a very good approximation of the full dataset's values. The difference is that they are based on equal sample sizes rather than different sample sizes, which means that the resulting Pillai scores will be comparable across speakers.

To identify whether there were patterns across genders or time, we fit the log-transformed Pillai scores to a linear model with gender, birth year, and their interaction as predictors (Fig. 5). Since the data were downsampled, Pillai scores can be directly compared and the influence of particularly talkative or reticent speakers is minimized. The results of this model show that Pillai scores for older women are low (indicating a merger) and older men are high (indicating a distinction). The difference between men and women decreases over time, with Pillai scores converging among the youngest speakers.

## VI. CONCLUSION

In this paper, we present a close view into the effect of sample size on resulting Pillai scores, a common measure for quantifying vowel overlap. We use a series of simulation experiments drawing from an underlyingly merged pair of vowels to demonstrate (1) that larger sample sizes yield reliably lower Pillai scores, (2) that unequal group sizes across the two vowel classes is irrelevant in the calculation of Pillai scores, and (3) that it takes more data than many sociolinguistic studies collect to reliably return a low Pillai score (e.g., under 0.1) even for underlyingly merged data. These results have implications for how Pillai scores are compared across studies and between speakers or speech styles within the same study. We provide some recommendations for maximizing reliability in the use of Pillai scores and provide a formula to assist researchers in determining a reasonable Pillai score threshold to use as an indicator of merged status given their sample size. We recommend the use of Eq. (1) in conjunction with the Pillai scores' accompanying *p*-values to make informed decisions about the merged status of two vowels in a given speaker. By properly using and reporting aspects of Pillai score, researchers can come closer to accurately quantifying vowel overlap, identifying vowel mergers, and ultimately understanding broad patterns of variation and change in vowel merger.

## ACKNOWLEDGMENTS

Thanks to the Linguistics Discussion Group at BYU and the audiences at ASA2021 in Seattle and NWAV50 in San Jose for their useful comments. We thank William Christensen and Joe Fruehwald for their help in understanding underlying distributions and transformations. We also thank the two anonymous reviewers for their thoughtful and encouraging feedback, without which this paper would have been substantially less clear and less impactful. Any remaining faults are our own. J.A.S. also graciously acknowledges the University of Georgia Graduate School Dean's Award for funding the fieldwork that produced the data used in the case study.

^{1}

See also Kelley and Tucker (2020) for details on the overlapping coefficient, another nonparametric method for calculating overlap.

^{2}

Preliminary work for Sneller (2018) attempted, with mixed results, a modified model approach of Bhattacharyya's affinity by extracting model values for fixed effects and manually adjusting the data. However, the onerousness of this approach makes it not widely implementable in comparison with Pillai, which has the advantage of being able to easily integrate commonly used mixed-effects models.

^{3}

For the uncorrelated data, the bivariate normal distribution was most easily generated by combining two independent univariate normal distributions, because the product of their probability densities is equal to their joint probability densities [Johnson and Wichern (2012), Chap. 4, p. 151]. We generated F1 in R by running rnorm(x, mean = 0, sd = 1), where *x* is the number of tokens generated. F2 was generated using the same code, and the two sets of numbers were combined to create the bivariate normal distribution.

^{4}

In this paper, we focus on the effect of sample size for underlyingly merged speakers. Future work may fruitfully investigate how sample size interacts with Pillai score for underlyingly unmerged speakers as well.

^{5}

Let $p95=e1\u2212log(m)$. Since $ex\u2212y=ex+(\u2212y)=exe\u2212y=ex(1/ey)=ex/ey$, and since $e1=e$, then $p95=e/eln(m)$. Since $eln(x)=x$, then $p95=e/m$. (We are grateful to the anonymous review who pointed out these simplification steps for this formula.) This is most easily implemented in R as exp(1)/m.

^{6}

We also reiterate the caution about small sample sizes overall and point out that very small samples may increase the likelihood of a type II error (i.e., a false negative or failing to reject the null hypothesis when it is actually false), leading the researcher to conclude that two vowels are merged when in reality there is just not enough data to detect a distinction.

^{7}

See supplementary material at https://www.scitation.org/doi/suppl/10.1121/10.0016757 for a brief tutorial on how to implement these recommendations in r.

^{8}

One reviewer asked why we did not recommend accounting for the difference in sample sizes across speech styles by simply adding a term for style in the linear model. The reasoning is fairly straightforward: researchers cannot control how many tokens of interest are produced in the conversational portion of the interview, allowing the difference in token count across style to vary wildly by speaker. Since sample sizes influence Pillai score in a predictable way but style influences Pillai score in an unpredictable way (as it is dependent on the differences in sample sizes), it is inadvisable to use style as the predictor rather than a more straightforward way to control for or incorporate sample size.

^{9}

Note that we sample with replacement (replace = TRUE), in order to make our resampling comparable across all speakers including our least talkative speaker. If we were to resample *without* replacement, this introduces a confound related to sample size: the amount of error introduced per speaker is proportional to their sample size. Resampling without replacement 1000 times would produce 1000 identical distributions for the least talkative speaker, but 1000 different distributions for every other speaker. Resampling *with* replacement allows us to obtain a standard deviation for each speaker with similar confidence and introduces a similar amount of uncertainty across speakers in the means of their distributions. For more information on bootstrapping with replacement, we refer the reader to Ismay and Kim (2020).

## References

_{˕}ki]: An Emerging Third-Order Index of a Hockey-Based Persona