Fricatives have noise sources that are filtered by the vocal tract and that typically possess energy over a much broader range of frequencies than observed for vowels and sonorant consonants. This paper introduces and refines fricative measurements that were designed to reflect underlying articulatory and aerodynamic conditions These show differences in the pattern of high-frequency energy for sibilants vs non-sibilants, voiced vs voiceless fricatives, and non-sibilants differing in place of articulation. The results confirm the utility of a spectral peak measure (FM) and low–mid frequency amplitude difference (AmpD) for sibilants. Using a higher-frequency range for defining FM for female voices for alveolars is justified; a still higher range was considered and rejected. High-frequency maximum amplitude (Fh) and amplitude difference between low- and higher-frequency regions (AmpRange) capture /f-θ/ differences in English and the dynamic amplitude range over the entire spectrum. For this dataset, with spectral information up to 15 kHz, a new measure, HighLevelD, was more effective than previously used LevelD and Slope in showing changes over time within the frication. Finally, isolated words and connected speech differ. This work contributes improved measures of fricative spectra and demonstrates the necessity of including high-frequency energy in those measures.

Fricative production involves the generation of a noise source and filtering of that source by the vocal tract. The spectrum of this noise reflects the underlying articulation and aerodynamics of the sounds [see Shadle (2023) and citations therein], carries phonemic information for listeners (e.g., Harris, 1958; Whalen, 1991), and can signal speaker-specific information, such as the size of the person's vocal tract (e.g., Fox and Nissen, 2005; Shadle , 2017; cf. also McGowan and Nittrouer, 1988) along with sex and sexual orientation (Fox and Nissen, 2005; Fuchs and Toda, 2010; May, 1976; Munson , 2006; Shadle , 2017).

It has long been known that some fricatives, such as /s ʃ ç x/, possess considerable high-frequency energy (e.g., Strevens, 1960). Further, frication noise covers not only a wide frequency range but also a wide dynamic (amplitude) range. These frequency and amplitude characteristics can also vary considerably over the time course of the sounds. The complexity of this aperiodic source has led researchers to develop many measures to differentiate phonemically different fricatives across languages, capture coarticulatory effects, and reveal speaker differences. As discussed in Sec. I B 6, previous studies employed various spectral slope measures that were designed to take high-frequency energy into account. In general, few extant measures have been designed specifically to quantify high-frequency fricative energy. In this work, we use theory-driven measures to explore how information above 7 kHz can reveal changes over time within the fricative noise and differentiate between sibilants and non-sibilants as well as fricatives varying in voicing and place of articulation.

The definition of what is considered a “high” frequency in the speech literature has varied widely. Jacewicz and Fox (2020) used 3.5 kHz as the lower boundary of this region, following historical work on the bandwidth required to transmit speech with reasonable intelligibility. For example, French and Steinberg (1947) reported that about 75% intelligibility was reached with high-frequency cutoffs between 3000 and 4000 Hz. Adding the frequency band from 4860 to 5720 Hz raised intelligibility from 90% to 95%, and frequencies of 5720–7000 Hz raised it from 95% to 100%. Thus, energy above 5.7 kHz was traditionally deemed to be non-critical for speech perception (Monson , 2014). For purposes of this paper, we will consider high frequencies as being those 7 kHz and above, following contemporary standards for wideband telephony (International Telecommunication Union, 2007). The fundamental aim of our research program is to understand fricative production mechanisms across languages and speaker types (adults vs children, typical vs clinical populations) and develop better models. That is, high-frequency information can provide us with information even if it is not perceptually salient to human listeners.1

Spectral moments were proposed as a method of quantifying fricative noises in some early work (e.g., Strevens, 1960; Jassem, 1965, 1979), and their use expanded subsequent to the publication of Forrest (1988). In adult speakers, the first moment [referred to as M1, center of gravity (COG)] and sometimes the third moment (M3 or L3) have been found to separate sibilants by place of articulation (Avery and Liss, 1996; Jongman , 2000; Shadle and Mair, 1996; Tjaden and Turner, 1997), and the second moment (M2) can separate sibilants and non-sibilants (Jongman , 2000; Shadle and Mair, 1996). However, as reviewed in detail by Shadle (2023), results of moments analyses can vary greatly as a function of recording and analysis methods, and moments values do not always allow clear inferences about production mechanisms. Thus, Shadle and coauthors (see Shadle, 2023; cf. also Koenig , 2013, Shadle , 2014) have proposed fricative measures that are more clearly related to articulatory and aerodynamic conditions than spectral moments. The measures were developed based on mechanical modeling work (Shadle, 1985), which evaluated how parameters such as constriction shape and size, downstream obstacles, and airflow rates affect noise generation. Later studies have sought to assess how conclusions based on mechanical modeling might be observed in human speech. Several studies in this vein incorporated measures of sustained fricatives, partly to facilitate description of “pure” frication noise, without contextual effects or complications of short signals (e.g., Badin , 1994; Jesus and Shadle, 2002; Shadle and Mair, 1996; Shadle , 2016; see also Strevens, 1960), leaving open the question of how well the measures would extend to fricatives in isolated words and in connected speech. In Sec. I B, we review how these measures can be employed to characterize specific fricative contrasts, pointing to the relevance of high-frequency information.

1. Sibilants: Revisiting the frequency ranges for defining FM

Compared to alveolar sibilants /s z/, the retracted tongue position and lip rounding of the post-alveolar sibilants /ʃ ʒ/ combine to lower the frequency of the front-cavity resonance (e.g., Gordon , 2002; Jesus and Shadle, 2002; Shadle , 1991). Our previously introduced measure FM was designed to obtain that resonant frequency in an automated fashion. For American English /ʃ/ produced by men, that front-cavity resonance is in the region of 2–3 kHz (Narayanan, 1995;2 Shadle, 1991); it is higher for /s/ than /ʃ/ and also higher in women compared to men (e.g., Shadle , 2017; Shadle , 2020a,b). Here, we revisit the frequency range used to establish FM in women's alveolar fricatives. Specifically, we ask whether raising the high-frequency cutoff from 7 to 8 kHz could improve our estimates of this resonance. Current definitions for FM and other measures are shown in Table I.

TABLE I.

Parameter definitions. Main outcome measures, which appear in the figures and statistical analyses, are in boldface; non-bolded items were calculated as the intermediate variables.

Measurement Definition
FM  The frequency of the spectral peak for the main resonance, formed by the critical constriction. Range for /ʃ/: 2– 4 kHz. Range for /s z/: 3–7 kHz for men; 3–8 kHz for women. The higher cutoff for women was implemented as a change from past work. Sibilants only. 
AmpD  The amplitude at FM minus the minimum amplitude of the low-frequency spectral valley formed by acoustic cancellation (dB). The low-frequency range was defined as 1–2 kHz for /ʃ ʒ/ and 1–3 kHz for all other sounds. Sibilants only. 
LevelM  Average of all squared magnitudes of the multitaper spectrum in the middle frequency range, converted to dB. The mid-frequency range was 2–4 kHz for /ʃ ʒ/ for men, 3–7 kHz for all other sounds; for women, 3–8 kHz for /s z/ and 3–7 kHz for all other sounds. 
LevelH  Average of all squared magnitudes of the multitaper spectrum in the high-frequency range (for men, 7–11 kHz for all sounds; for women, 8–11 kHz for /s, z/, and 7–11 kHz for all other sounds; converted to dB. 
LevelHH  Average of all squared magnitudes of the multitaper spectrum from 11 to 15 kHz, converted to dB. 
LevelD  Difference between LevelM and LevelH (= LevelM – LevelH). 
HiLevD  Difference between LevelM and the sum of LevelH and LevelHH (= LevelM – LevelH – LevelHH). 
Fh  The frequency of the maximum amplitude from 5 to 12 kHz. Non-sibilants only. 
FmaxA  The frequency of the maximum amplitude from 2 to 13 kHz. Non-sibilants only. 
FminA  The frequency of the minimum amplitude from 1 to 7 kHz. Non-sibilants only. 
AmpRange  The amplitude at FmaxA minus the amplitude at FminA. Non-sibilants only. 
Slope  The slope of the regression line fitted to the multitaper spectrum from FM to 14 kHz for sibilants and from Fh to 14 kHz for non-sibilants. 
Measurement Definition
FM  The frequency of the spectral peak for the main resonance, formed by the critical constriction. Range for /ʃ/: 2– 4 kHz. Range for /s z/: 3–7 kHz for men; 3–8 kHz for women. The higher cutoff for women was implemented as a change from past work. Sibilants only. 
AmpD  The amplitude at FM minus the minimum amplitude of the low-frequency spectral valley formed by acoustic cancellation (dB). The low-frequency range was defined as 1–2 kHz for /ʃ ʒ/ and 1–3 kHz for all other sounds. Sibilants only. 
LevelM  Average of all squared magnitudes of the multitaper spectrum in the middle frequency range, converted to dB. The mid-frequency range was 2–4 kHz for /ʃ ʒ/ for men, 3–7 kHz for all other sounds; for women, 3–8 kHz for /s z/ and 3–7 kHz for all other sounds. 
LevelH  Average of all squared magnitudes of the multitaper spectrum in the high-frequency range (for men, 7–11 kHz for all sounds; for women, 8–11 kHz for /s, z/, and 7–11 kHz for all other sounds; converted to dB. 
LevelHH  Average of all squared magnitudes of the multitaper spectrum from 11 to 15 kHz, converted to dB. 
LevelD  Difference between LevelM and LevelH (= LevelM – LevelH). 
HiLevD  Difference between LevelM and the sum of LevelH and LevelHH (= LevelM – LevelH – LevelHH). 
Fh  The frequency of the maximum amplitude from 5 to 12 kHz. Non-sibilants only. 
FmaxA  The frequency of the maximum amplitude from 2 to 13 kHz. Non-sibilants only. 
FminA  The frequency of the minimum amplitude from 1 to 7 kHz. Non-sibilants only. 
AmpRange  The amplitude at FmaxA minus the amplitude at FminA. Non-sibilants only. 
Slope  The slope of the regression line fitted to the multitaper spectrum from FM to 14 kHz for sibilants and from Fh to 14 kHz for non-sibilants. 

Maniwa (2009) also included a spectral peak frequency as one of their 14 parameters. They defined it differently, however, by first pre-emphasizing the ensemble-averaged spectrum and then finding the maximum amplitude over the 15 kHz range. This would mean that in spectra with significant high-frequency energy, the peak frequency would not necessarily be the lowest uncanceled front-cavity resonance, which is what our parameter is designed to estimate.

2. Voiced vs voiceless fricatives

The different laryngeal settings for voiced vs voiceless fricatives have predictable acoustic effects. In particular, the periodically interrupted airflow that occurs in adducted (voiced) conditions leads to a reduction in the pressure drop across the supraglottal fricative constriction. Conversely, abduction for voicelessness allows for a larger pressure drop across the supraglottal constriction. A greater pressure decrease, all else being equal, corresponds to a greater source strength (Stevens, 1971, 1998). The effect is to increase noise amplitudes, particularly at higher frequencies (Jesus and Shadle, 2002; Krane, 2005; Shadle and Mair, 1996). Accordingly, we expect that voiced fricatives will have lower noise amplitudes above 7 kHz than voiceless ones.

The higher noise source strength in voiceless fricatives should also lead to a larger amplitude difference (AmpD, cf. Sec. I B 3) between the main fricative peak and the lower-frequency spectral minimum in voiceless fricatives than voiced ones. Jesus and Shadle (2002) predicted this in their study of European Portuguese and found it to hold for the sibilants (but not the non-sibilants).

3. AmpD and AmpRange: Evaluating changes over time

We previously proposed a measure, AmpD,3 to differentiate sibilants and non-sibilants (Jesus and Shadle, 2002; Koenig , 2013; Shadle, 1985; Shadle and Mair, 1996; Shadle , 2020a,b). AmpD represents the amplitude difference between the low-frequency spectral minimum and the main spectral peak usually occurring at the front-cavity resonance. In sibilants, the lingual constriction directs a turbulent jet at the teeth, yielding a strong and localized noise source. Together these conditions generate a low-frequency anti-resonance and strongly excited resonances at mid and higher frequencies, giving this class of fricatives a large AmpD (ca. 20–35 dB).

AmpD also varies over the time course of a fricative, which may relate in part to a decrease in the constriction size over time (cf. Koenig , 2013; Shadle , 2017). When the cross-sectional area of the constriction gets small enough, the front and back cavities are acoustically decoupled, which causes back cavity resonances to be cancelled, deepening the low-frequency trough. Simultaneously, higher air particle velocity within a smaller constriction generates more turbulence noise and, therefore, raises the amplitude of the main spectral peak. Both factors, the deeper trough and higher amplitude peak, increase AmpD over the time course of the fricative. Thus, in general, AmpD shows an inverted v-shaped trajectory over time (Jesus and Shadle, 2002; Koenig , 2013). Jesus and Shadle (2002) further found that the temporal differences were largest for sibilants, where the turbulent jet impinging on the incisors provides more efficient noise generation than for the non-sibilants. In the current work, we compare AmpD values for sibilants to a new measure, AmpRange, designed to capture changes over time in non-sibilants. AmpRange, like AmpD, quantifies the amplitude difference between a band of lower vs higher frequencies but differs in how those low and high frequencies are defined. Specifically, the low-frequency range within which the minimum for AmpRange is found is considerably larger (1–7 kHz), as is the frequency range within which the maximum is found (2–13 kHz). Together, these ranges recognize (a) the relative flatness of non-sibilant spectra and (b) the fact that spectral peaks for these sounds have been observed to exceed 8 kHz (e.g., Fox and Nissen, 2005; Maniwa , 2009). In short, AmpRange was intended to be a valid non-sibilant analogue of AmpD.

4. Fh: Place differences in non-sibilants

In contrast to the sibilants, non-sibilants remain thinly described in the literature. Studies that did evaluate non-sibilants, such as /f/ and /θ/, have often suggested that these may be difficult to differentiate based on noise characteristics (e.g., Behrens and Blumstein, 1988; Forrest , 1988; Harris, 1958; Fox and Nissen, 2005; Tabain, 1998), and authors have pointed to an important role of formant transitions for perceptual differentiation (Harris, 1958; Shadle , 1996; cf. also Jongman , 2000; McMurray and Jongman, 2011).4

One issue here is that the sibilants, produced with a lingual constriction in the oral cavity, can clearly be described as having a cavity downstream of the constriction that produces the turbulent jet. It is more difficult to ascertain whether the lips serve to provide a downstream cavity for the anterior constrictions of /f/ and /θ/, and models have varied on this point (see next paragraph). The spectra of both /f/ and /θ/ have been described as being rather flat (e.g., Flanagan, 1972; Gordon , 2002; Hughes and Halle, 1956), and their spectral peak locations vary widely (Jesus and Shadle, 2002; Nirgianaki, 2014). Both of these observations are consistent with the absence of the filtering effects of an anterior cavity.

Fant (1970) and Stevens (1998) developed models for /f/ incorporating a short front cavity; Flanagan (1972) used a simple two-tube model, with the anterior tube, which had a smaller cross-sectional area, modeling the labiodental constriction. The decisions about the size and number of cavities in their models affected their decisions about source type and location, which in turn affected the filter functions that the noise source(s) excited. Fant and Stevens both placed the noise source in the short front cavity; Flanagan placed it at the junction between the back cavity and front constricted section. As a result, Flanagan's model predicts a series of pole-zero pairs that cancel each other out, yielding a nearly flat spectrum up to 10 kHz. Fant's model instead predicts a peak, the “main fricative formant” related to “the small cavity in front of the upper front teeth in combination with the radiation impedance,” to be at 8500 Hz for [f] (Fant, 1970, p. 183).5 However, subsequent work indicated that these models, and their assumptions about source locations, were overly simplified (Shadle, 1985; see Sec. IV A for more details).

As indicated in Sec. I B 3 (in the discussion of AmpRange), measures developed for sibilants may not yield optimal results for non-sibilants. Pronounced peaks are a typical feature of sibilants, but much less so for non-sibilants. This has implications not only for measures of the peak value, but also for measures that depend on it (e.g., slopes below and above the peak; cf., e.g., Jesus and Shadle, 2002; Maniwa , 2009). The new measure Fh, like FM, is designed to capture a frequency peak but is defined over a much broader range than FM (cf. Table I). This measure may serve to capture place differences among the non-sibilants.

5. Isolated words vs connected speech

As in Koenig (2013), we collected data from multiple speech tasks (refer to Sec. II for details). Specifically, speakers produced some isolated words and some connected speech. This allowed us to ensure that measures that previously differentiated sustained fricatives produced at different effort levels (e.g., Shadle and Mair, 1996; Jesus and Shadle, 2002) can also characterize more naturally produced speech. We also assessed whether speaking style, here defined as words in isolation vs words produced rapidly or in sentence or paragraph form, affects any of our dependent measures.

6. Overall spectral shape

The overall spectral balance of fricatives has been quantified in various ways. Jesus and Shadle (2002) captured high-frequency content by means of a slope fitted to the spectrum from a mid-frequency reference value (the peak frequency, averaged across all tokens of that place for that speaker) to higher-frequency regions of the spectrum (up to 20 kHz).6 The expectation was that the line so fit would average across resonances above FM and would, therefore, serve as an estimate of the noise source spectrum that excited the resonances. For the sustained fricatives of that study, this expectation was realized. However, with the lower sampling rate in the Koenig (2013) study, there were fewer points in the spectrum to average across (from FM to 11 kHz), so the LevelD measure was devised, which found the difference in spectrum density levels between mid and high frequencies. LevelD should be less influenced than a slope would be by the loss of very high-frequency information. Level measures may also be more appropriate for data that do not include amplitude calibration (where both the slope and the intercept are interpretable; cf. Jesus and Shadle, 2002). We also explore the usefulness of a HighLevelD measure, which combines the amplitudes from 11 to 15 kHz with LevelD.

This work refines and extends the acoustic measures for fricatives, drawing heavily on our previous research. The new data include fricatives in words and in connected speech, which represent a more challenging situation than the sustained fricatives used in some past work. We explore whether high-frequency information can be used to capture changes over time, particularly in non-sibilants, differentiate between the non-sibilants /f/ and /θ/, and reflect source strength differences as a function of voicing. We also revisit the frequency ranges used to define FM in /s/.

We obtained data from seven adult native speakers of American English (four women, three men; ages 21–58). All had self-reported typical speech and hearing and were perceived by the authors to have speech patterns within normal limits.

Our recordings used a higher sampling rate than in Koenig (2013). The resulting data provide acoustic information up to 15 kHz, allowing us to explore higher-frequency regions than was possible in much past work. As with the data presented in Koenig (2013), the current signals did not have amplitude calibration; thus, we evaluate relative differences within speakers.

Recordings were made in an anechoic chamber with equipment chosen to be comparable to that of Preston and Edwards (2007, 2009), who provided the signals analyzed in Koenig (2013). A Shure (Niles, IL) headset microphone (WH30) was used, connected to an Icicle amplifier (Blue Microphones, Westlake Village, CA). This equipment is typical of that available to clinicians. Data were recorded directly into Praat at a sampling rate of 44.1 kHz, with a Nyquist frequency of 22.05 kHz. However, the frequency range analyzed here is limited to 15 kHz by the Icicle amplifier.

To allow comparison with our past work on adolescents (Koenig , 2013; see also Preston and Edwards, 2007, 2009), speakers were recorded on four tasks, described below and abbreviated henceforth as picture naming (PN), rapid picture naming (RPN), grandfather passage (GF), and recalling sentences (RS). These tasks were originally chosen because they have previously been used for clinical assessment of children's speech and language abilities (e.g., Preston and Edwards, 2007, 2009). The plots in Sec. III show fricatives for all tasks combined.

PN: These were 114 single words produced in response to pictured stimuli.

RPN: A stimulus set designed by Preston and Edwards (2009) to elicit multiple productions of multisyllabic, familiar words: elephant, umbrella, strawberries, thermometer, helicopter, spaghetti. Participants said the words in response to a series of pictures; first slowly and deliberately, and then rapidly with the items randomly ordered, resulting in five or six tokens of each word.

GF: A frequently used reading passage (see Reilly and Fisher, 2012) that is both syntactically and phonologically complex; it includes a number of fricatives in singleton form and in clusters.

RS: A subtest of the Clinical Evaluation of Language Fundamentals–4 (Semel ., 2003). Speakers repeat sentences of increasing length and complexity following an adult model.

The current paper presents results for the English fricatives /f v θ ð s z ʃ/. We analyzed /ʒ/ but did not include it in the results presented here because there were too few examples in the dataset. We show spectra for /ð/ and parameters derived for them to allow comparison with /θ/ but must be wary of over-generalization because most instances of /ð/ were found in function words; comparisons of voicing (i.e., /θ-ð/) are, thus, confounded with stress differences. Further, informal observations indicated that /ð/ productions in connected speech were often realized as stops or approximants. These productions were included in the analysis as long as the fricative duration exceeded 50 ms. The final dataset includes 2164 fricative productions (270–407 tokens per speaker).

For all tasks, an orthographic transcription was prepared as input to the Montreal Forced Aligner (MFA; McAuliffe , 2017). This system draws on the Kaldi Automatic Speech Recognition toolkit to generate a sequence of phone boundaries, represented in Praat (Boersma and Weenink, 2001) TextGrid format. Fricative boundaries were adjusted/corrected by one experienced researcher (the first author) and/or a graduate student who was first trained on corrections made by the experienced researcher. This manual correction process relied mainly on visual inspections of the spectrogram in Praat, with some reference to the waveform, to determine the onset and offset of fricative noise regions. Labeling was based on frication noise, not voicing characteristics (e.g., voicing bleed from a preceding vowel into a following fricative). The manual adjustment allowed for a smaller time window than the 10 ms of the aligner.

To assess reliability, both annotators corrected a subset of the data consisting of the AT task7 for three speakers (N = 267 tokens). In this subset, the two annotators kept 23.8% of MFA-generated boundaries unadjusted (20.6% for onsets, 27% for offsets). They showed high agreement for boundary locations: The maximum absolute between-annotator difference in boundary locations was 88 ms for frication onsets and offsets. The mean differences were quite small and well within the analysis windows [mean = 1 ms, standard deviation (SD) = 16 ms, for onset; mean = 2 ms; SD = 8 ms, for offsets]. It is likely that agreement for connected speech (e.g., GF) would be somewhat worse than what we obtained for this single-word naming task. Nevertheless, these results indicate that the combination of the MFA followed by human correction provides a solid basis for the analysis.

Spectra generated at mid-fricative of all recordings were reviewed to determine appropriate heuristics for the final measurement set. Examples for all speakers are shown in Figs. 2 and 5 in Sec. III. Where justified by past work and theoretical considerations, some heuristic frequency bands differ for men and women, as shown in Table I and discussed below.

Speech signals were downsampled to 30 kHz to match the Nyquist frequency to the maximum frequency of the amplifier's passband. Each labeled fricative token was analyzed in 30 ms windows at three equally spaced temporal locations. The windows overlapped when necessary. Fricatives shorter than 50 ms were excluded from analysis. A spectral density estimation was generated using the Multitaper method (Thomson, 1982; Blacklock, 2004) for each analysis window, with seven orthogonal taper functions.

For sibilants, we computed the parameters FM, AmpD, LevelM, LevelH, LevelHH, LevelD, HighLevelD, and (spectral) Slope for each multitaper spectrum. For non-sibilants, the measurements were FmaxA, FminA, Fh, AmpRange, LevelD, HighLevelD, and (spectral) Slope. The definitions are given in Table I. Note that some values were intermediate variables required to calculate the main measurements (boldfaced in Table I) for characterizing fricatives. Figure 1 illustrates these definitions with examples of a sibilant [Fig. 1(a)] and non-sibilant [Fig. 1(b)]. The frequency ranges for these parameters were determined heuristically. FmaxA was the frequency of the maximum amplitude over 2–13 kHz, a range chosen because of the wide variation in spectral peak locations for non-sibilants. FminA was the frequency of the minimum amplitude over 1–7 kHz, with the upper limit designed to exclude attenuation at higher frequencies due to the amplifier. AmpRange was the difference in amplitudes between those at FmaxA and FminA and was designed to measure the dynamic range of the entire noise spectrum. Fh was the frequency of the maximum amplitude over a somewhat narrower, high-frequency range, 5–12 kHz, because the broad peak within this range appeared in many of the non-sibilants and might indicate the self-noise of the jet, that is, be related to something other than a resonance of the front cavity.

FIG. 1.

(Color online) Illustrations of the definitions of acoustic measurements, annotated on the mid-fricative multitaper spectrum of an /s/ (left) and /f/ (right) (female speaker). Uppercase letters (L, M, H, HH) near the top boundary indicate the heuristically defined frequency ranges (low, middle, high, highest).

FIG. 1.

(Color online) Illustrations of the definitions of acoustic measurements, annotated on the mid-fricative multitaper spectrum of an /s/ (left) and /f/ (right) (female speaker). Uppercase letters (L, M, H, HH) near the top boundary indicate the heuristically defined frequency ranges (low, middle, high, highest).

Close modal

To demonstrate the usefulness of the proposed acoustic measurements, we fitted linear mixed effects (LME) models by using the “LME4” (Bates , 2015) and “lmerTest” (Kuznetsova , 2017) packages in R (R Core Team, 2020) to characterize the contrasts in place of articulation for /s/ vs /ʃ/ and /f/ vs /θ/ and in speaking styles (i.e., connected speech vs isolated words). We labeled samples in PN as “isolated words” and those in RPN, GF, and RS as “connected speech.” The dependent variables for sibilants were FM, AmpD, LevelD, and HighLevelD, and those for non-sibilants were Fh, AmpRange, LevelD, and HighLevelD. The fixed effects included Place of Articulation (two levels), Style (two levels: connected speech and isolated words), and Gender and random effects included by-speaker random intercepts, as well as by-speaker random slopes for the effect of Place of Articulation. The formula for the final models was (determined by likelihood ratio tests) DepVar ∼ Place of Articulation + Style + Gender + (Place of Articulation | Speaker), where DepVar is one of the four dependent variables. The interactions of fixed effects did not improve the model for most of the dependent variables and, thus, are not included. Corrections for p-values in multiple hypothesis testing were performed by using the false discovery rate (FDR; Benjamini and Yekutieli, 2005), with the significance level set at 0.05. Marginal means of the main effects based on the predictions of the optimal model were estimated by using the “emmeans” (Lenth, 2022) package in R.

Figure 2 shows mid-fricative spectra for the sibilants /s, ʃ, z/ for all subjects. The central bold line in each subplot is the ensemble average (i.e., averaging across tokens/repetitions) of all multitaper spectra of a given phoneme produced by that speaker. The shaded region is ±1 SD away from the ensemble average; its thickness is a function of both the variability of that fricative and the number of tokens, given by the value of n shown in each subplot. The multitaper spectra are shown by thin lines for each token. All spectra extend to 15 kHz.

FIG. 2.

(Color online) Mid-fricative spectra for the sibilants /s, ʃ, z/ for all speakers (women speakers in left four columns, men speakers in right three columns). The central bold line in each plot is the ensemble average of all multitaper spectra by that speaker for that phoneme. The shaded region is ±1 SD around the bold line. Number of examples analyzed is given by the n value at top left in each subplot. Thin lines are the individual multitaper spectra.

FIG. 2.

(Color online) Mid-fricative spectra for the sibilants /s, ʃ, z/ for all speakers (women speakers in left four columns, men speakers in right three columns). The central bold line in each plot is the ensemble average of all multitaper spectra by that speaker for that phoneme. The shaded region is ±1 SD around the bold line. Number of examples analyzed is given by the n value at top left in each subplot. Thin lines are the individual multitaper spectra.

Close modal

The minima and maxima apparent in these plots have been well-studied over the years, and they demonstrate past observations concerning sibilant spectra: The frequency of the maxima vary by speaker, but those for /s, z/ are higher than for /ʃ/ within a given speaker. Phonetic context may affect the frequency of the maxima to different degrees depending on the speaker; see, for example, the /s/ in /usu/ context for a French speaker (Shadle and Scully, 1995; Shadle and Mair, 1996). Frequencies of the maxima also tend to be lower for men than for women, consistent with earlier studies (Jongman , 2000; Fox and Nissen, 2005). Some patterns vary across speakers: for example, many speakers show two clear spectral peaks for /ʃ/ (W2, W3, possibly W4; M1, M2, M3), but W1 does not. Finally, we see that peak amplitudes are lower in /z/ than /s/, as expected, because of the effects that laryngeal adduction and voicing have on the noise source.

Figure 3 displays average values of the parameter FM (top row), AmpD (second row), LevelD (third row), and HighLevelD (bottom row), for the sibilants /s, ʃ, z/, at three multitaper-analysis frames located at the beginning (B), middle (M), and end (E) of each fricative. Values for a given subject are connected by thin dashed lines and, online, are shown with the same circle color. Starting from the top row of Fig. 3, we observe a general trend that women have higher FM values overall, especially for /s/, while displaying the same relative patterns as the men across places of articulation. The pattern of FM variations through the fricatives varies across speakers.

FIG. 3.

(Color online) Average values of the parameters FM, AmpD, LevelD, and HighLevelD (top to bottom) for beginning (B), middle (M), end (E) timesteps in the sibilants /s, ʃ, z/. Each speaker's values are connected by dotted lines and, online, are shown by the same color circle. Colors correspond to the spectra shown in Fig. 2.

FIG. 3.

(Color online) Average values of the parameters FM, AmpD, LevelD, and HighLevelD (top to bottom) for beginning (B), middle (M), end (E) timesteps in the sibilants /s, ʃ, z/. Each speaker's values are connected by dotted lines and, online, are shown by the same color circle. Colors correspond to the spectra shown in Fig. 2.

Close modal

The parameter FM, the frequency of the main peak, was defined heuristically to allow for automatic computation. As shown in Table I, in this study we explored extending the upper frequency limit from 7 to 8 kHz for women's /s, z/. Figure 2 shows that some /s/ peaks for W4 are at or slightly above 8 kHz; for W1, some /s/ peaks are closer to 10 kHz. This could suggest that the upper boundary for capturing FM should be extended to even higher frequencies. However, FM was designed to capture the lowest-frequency uncanceled resonance. For both W1 and W4, in some tokens of /s/ there is a lower-amplitude but still significant peak at approximately 5–6 kHz that is likely to be the lowest uncanceled front-cavity resonance. Visual inspection of the data suggests that the prominence of this peak and its frequency varies with phonetic context, in particular, whether the context is labialized (Koenig , 2013). Our simple algorithm may, thus, sometimes identify the “wrong” peak, whether the upper limit is 7 or 8 kHz. An algorithm that tracked resonances from adjacent phones and picked the lowest-frequency peak that remained uncanceled could be developed to generate more accurate estimates of FM. The source of the higher-frequency peak in /s, z/ for some speakers calls for further study.

The parameter AmpD (the amplitude difference), shown in the second row of Fig. 3, ranged from 20 to 35 dB for sibilants and, in general, peaked at mid-fricative, as expected. This measure encodes information from both lower- and higher-frequency parts of the spectrum; the difference between the minimum at low frequencies (approximately 1 kHz) and the maximum at FM captures both the strength of the noise source and the shape of the filter function due to the type and localized nature of that source. Together these result in large AmpD values.

The third row of Fig. 3 shows the parameter LevelD, the difference in spectral density levels between the mid- and high-frequency ranges (cf. Table I), for the sibilants. First used by Koenig (2013) to study /s/, this was defined to quantify an increase in energy in the noise source at higher frequencies for spectra that extended up to 11 kHz and for which finding the slope seemed not to be justified (see Sec. I B 6). LevelD is predicted to drop mid-fricative, when amplitudes have increased over the entire range, but more so at higher frequencies; in Fig. 3 (third row), it does as predicted in most cases. A similarly computed parameter HighLevelD, (bottom row) made use of the highest-frequency parts of the spectrum by subtracting the density level for frequencies from 11 to 15 kHz from LevelD. The drop mid-fricative predicted for LevelD is more pronounced for HighLevelD and occurs in all cases. Note that this is consistent with results shown for /s/ in the XRMB dataset (Shadle, 2023, p. 1423), in which FM is approximately constant throughout the nine timesteps of the fricative but M1 rises and then falls. The change in energy above FM explains the pattern of M1 changes and is quantified in this study by the parameters LevelD and HighLevelD.

The parameter Slope was also computed (not shown here), and this is the slope of a line fit from FM to 14 kHz. It is predicted to rise mid-fricative, that is, the slope should become less negative as energy at high frequencies increases more than at mid-frequencies. In some cases, that prediction is borne out, but not for all speakers or all sibilants. It may be that when computing a slope, even a spectrum extending to 15 kHz is not high enough to average out the resonance peaks adequately; the other differences noted earlier, in the type of corpus and whether amplitude calibration was done, may also affect the slope parameter. Results indicate that HighLevelD (Fig. 3, bottom row) is the parameter that comes closest to capturing the change in source characteristic that occurs during a fricative.

Figure 4 summarizes the results of four separately fitted LME models for characterizing the contrasts between /s/ and /ʃ/ (top panels) and between connected speech and isolated words (bottom panels) at mid-fricative, for each of four dependent variables: FM, AmpD, LevelD, and HighLevelD. The p-values shown in Fig. 4 were FDR-corrected. As expected, /s/ and /ʃ/ differed significantly in all four main measurements. The effect of Gender did not meet the level of significance for any of the four models. The effect of Style was significant only for HighLevelD; sibilants in isolated words were estimated to have higher HighLevelD than those in connected speech by 1.9 dB on average. The variances explained by the main effects (marginal R2) were 44%, 6%, 12%, and 22% for the four models with the dependent variables FM, AmpD, LevelD, and HighLevelD, respectively.

FIG. 4.

(Color online) LME model predicted marginal means and 95% confidence intervals for the contrasts between /s/ and /ʃ/, across both Styles (top), and between connected speech and isolated words, across both Places of Articulation (bottom) in the four main measurements. The p-values shown in the figure were FDR-corrected. ****, p < 0.0001; ***, p < 0.001; **, p < 0.01; *, p < 0.05.

FIG. 4.

(Color online) LME model predicted marginal means and 95% confidence intervals for the contrasts between /s/ and /ʃ/, across both Styles (top), and between connected speech and isolated words, across both Places of Articulation (bottom) in the four main measurements. The p-values shown in the figure were FDR-corrected. ****, p < 0.0001; ***, p < 0.001; **, p < 0.01; *, p < 0.05.

Close modal

A fifth model was also tested using Slope as a dependent variable. Results were not significant, for either the contrast between /s/ and /ʃ/ (p = 0.81) or that between connected speech and isolated words (p = 0.23).

Figure 5 shows mid-fricative spectra for the non-sibilants, in the same format as Fig. 2. We can observe that the shaded region representing ±1 SD from the mean is larger than for the sibilants. Also, as with /s, z/, the voiced fricatives /ð, v/ are lower in amplitude than the voiceless /f, θ/. For the interdentals, this amplitude difference may not only reflect the voicing difference, but could be due to the context; as noted earlier, most occurrences of /ð/ are in function words, which are often brief and reduced in amplitude. In this dataset, we also observe that /θ/ has a lower overall amplitude than /f/, in contrast to Fox and Nissen (2005), who observed no amplitude differences between these fricatives either for absolute or normalized measures.

FIG. 5.

(Color online) Mid-fricative spectra for the non-sibilants /f, θ, v, ð/ for all speakers. Format and colors are the same as for Fig. 2.

FIG. 5.

(Color online) Mid-fricative spectra for the non-sibilants /f, θ, v, ð/ for all speakers. Format and colors are the same as for Fig. 2.

Close modal

The voiceless fricatives exhibit an increase in amplitude at high frequencies, and the maximum occurs at a higher frequency for /f/ than /θ/ for most speakers. Although modest compared to the amplitude excursions in sibilants, these differences may be captured by the new measures of Fh and AmpRange, both of which are defined with reference to high-frequency energy (up to 12 kHz for Fh, and up to 13 kHz for AmpRange). The behavior of this high-frequency range may also shed light on the best way to model the production of the non-sibilants.

In the same format as Fig. 3, Fig. 6 displays changes over time in Fh, AmpRange, LevelD, and HighLevelD for the non-sibilants /f, θ, ð/. The parameter Fh, shown in the first row, was defined as the frequency of the maximum amplitude point from 5 to 12 kHz (cf. Table I). There is no clear pattern with regard to change through the fricative; Fh is sometimes but not always highest mid-fricative. However, it is higher for /f/ than /θ/ for all three male and two of the four female speakers. In other words, energy in this high-frequency range may serve to differentiate non-sibilants differing in place of articulation.

FIG. 6.

(Color online) Average values of the parameters Fh, AmpRange, LevelD, and HighLevelD (top to bottom) for beginning, middle, and end timesteps (labeled B, M, and E) in the non-sibilants /f, θ, ð/. Each speaker's values are connected by dotted lines and, online, are shown by the same color circle. Speaker colors are as in previous figures.

FIG. 6.

(Color online) Average values of the parameters Fh, AmpRange, LevelD, and HighLevelD (top to bottom) for beginning, middle, and end timesteps (labeled B, M, and E) in the non-sibilants /f, θ, ð/. Each speaker's values are connected by dotted lines and, online, are shown by the same color circle. Speaker colors are as in previous figures.

Close modal

The second row shows AmpRange, a parameter analogous to AmpD; it is defined as the amplitude difference between the low-frequency minimum, FminA, and the higher-frequency maximum, FmaxA. The average values range from 10 to 20 dB, which is smaller than AmpD (see Fig. 3, second row) but still appreciable. That is, the non-sibilants, like the sibilants, have an increase in acoustic energy from low frequencies to higher ones. However, unlike the AmpD measure used to characterize sibilants, AmpRange does not in general peak mid-fricative.

Similar to the pattern seen in the sibilants, HighLevelD (bottom row) for non-sibilants shows more noticeable and more consistent patterns than LevelD (third row), with values dropping mid-fricative in nearly all cases. As was done for sibilants, a slope was computed, in this case from Fh up to 14 kHz; results were inconsistent and are not shown.

The results of LME models for characterizing the contrasts between /f/ and /θ/ (top panels) and between connected speech and isolated words (bottom panels) are given in Fig. 7. Significant effects of Gender were observed in LevelD and HighLevelD, where male speakers showed lower values in both cases. The FDR-corrected p-values in Fig. 7 indicated that /θ/ was characterized by marginally lower Fh, marginally higher AmpRange, and significantly higher HighLevelD than /f/. The effects of Style were significant for AmpRange and HighLevelD; non-sibilants in isolated words were estimated to have higher AmpRange and higher HighLevelD than those in connected speech by 1.2 and 3.2 dB, respectively. The variances explained by the main effects (marginal R2) were 6.6%, 7.7%, 7.4%, and 25.5% for the four models with the dependent variables Fh, AmpRange, LevelD, and HighLevelD, respectively. As a comparison of interest, we fitted two additional models with Slope and the first spectral moment (i.e., M1 or COG), a commonly used acoustic measure for fricatives in the literature, as the dependent variable. The results showed that /f/ and /θ/ did not significantly differ in either Slope (p = 0.82) or M1 (p = 0.57). The Style contrast showed a slight trend in Slope: Connected speech had a higher (less negative) slope than isolated words by 0.29 dB/kHz (p = 0.18), but M1 did not differ significantly with Style (p = 0.85).

FIG. 7.

(Color online) LME model predicted marginal means and 95% confidence intervals for the contrasts between /f/ and /θ/, across both Styles (top), and between connected speech and isolated words, across both Places of Articulation (bottom) in the four main measurements. The p-values shown in the figure were FDR-corrected. ***, p < 0.001; *, p < 0.05; †, p < 0.1; n.s., non-significant.

FIG. 7.

(Color online) LME model predicted marginal means and 95% confidence intervals for the contrasts between /f/ and /θ/, across both Styles (top), and between connected speech and isolated words, across both Places of Articulation (bottom) in the four main measurements. The p-values shown in the figure were FDR-corrected. ***, p < 0.001; *, p < 0.05; †, p < 0.1; n.s., non-significant.

Close modal

This set of recordings, with a frequency range extending up to 15 kHz, was used to shed light on the types of fricative information that might be present at higher frequencies and to explore contrasts in fricatives and changes within fricative noises. One might expect inclusion of high frequencies to be especially pertinent for the non-sibilants, which are more perceptually confusable than the sibilants and lack good descriptive parameters.

The sibilants have a set of parameters that have been used in many studies. FM, an estimate of the frequency of the lowest uncanceled resonance, distinguishes place, with /s, z/ having higher FM than /ʃ, ʒ/. (Although there were too few examples of /ʒ/ to include in the statistical analyses, their spectra were analyzed in the same way as shown in Fig. 2, and spectral features such as FM were comparable to those of /ʃ/.) Men have lower FM than women for a given place. For /s/ spoken by women, FM can range up to 8 kHz or more. Past work also indicates that FM is lower in labial contexts (Koenig , 2013; Shadle, 2023; Shadle and Mair, 1996).

The amplitude at FM, and that of the minimum at low frequencies, define AmpD. In its original form, FM was defined as the frequency of the maximum amplitude over the entire spectrum [e.g., from 500 Hz to 10.2 kHz in Shadle (1985), from 500 Hz to 17 kHz in Shadle and Mair (1996), from 500 Hz to 20 kHz in Jesus and Shadle (2002)]. In those studies, AD (as it was designated), computed for the entire spectrum of sustained fricatives, clearly was larger for sibilants than non-sibilants. However, in working on words beginning with /s/ in the XRMB corpus (Shadle , 2016, 2017), it was found that the point having the maximum amplitude jumped about in frequency throughout a fricative. Often, the jumps were caused by a very small change in amplitude of one spectral peak relative to another. This realization led us to impose a band of frequencies within which to find the maximum, with the intent of causing the automatically determined FM to match the manually determined “frequency of the lowest uncanceled resonance.” It is not clear whether the difference was due to fricative context, audio quality of the various datasets, or other factors, but our studies from Koenig (2013) onward have used limited frequency bands to define FM and, therefore, AmpD. In addition to the sibilant–non-sibilant differences, AmpD has also been used with clinical populations, identifying “sibilant goodness” (Shadle , 2018; Cox , 2019). AmpD also indicates the changes that occur during a fricative, rising and falling as the turbulence noise begins, grows, and then ends. If FM were defined using a frequency range restricted to, say, 3–6 kHz, AmpD for sibilants would be smaller but still might exhibit similar patterns.

In this paper, three additional parameters were used to estimate the changes over time in the noise source spectrum. LevelD and Slope from FM on up have been used before; HighLevelD, which uses a third spectral density level from 11 to 15 kHz, makes use of the extended frequency range recorded in this study. Of the three, Slope was least effective. It has been useful before (e.g., Shadle and Mair, 1996; Jesus and Shadle, 2002), but those studies employed amplitude calibration and had spectral information to 17 kHz or more, and the estimate was made on several tokens of sustained fricatives. In this dataset, HighLevelD was the most sensitive indicator of increased high-frequency energy mid-fricative. For all phonemes but /ð/, every speaker showed the expected drop—beginning or end. Some exceptions for /ð/ could be due to the fact that it was sometimes realized as a stop or approximant. Thus, researchers seeking to characterize fricative noise differences over time should consider whether slope or density level difference measures are more appropriate for their data.

The non-sibilants were more varied in general, particularly in the low frequencies, with a trough at low frequencies visible in some speakers' mid-fricative spectra (Fig. 5). All speakers, however, showed a broad peak at mid to high frequencies in /f/ and /θ/. The parameter Fh was defined to quantify that peak as the frequency of the maximum amplitude in the range 5–12 kHz. Values of Fh were marginally lower for /θ/ than /f/ for five of the seven speakers. Thus, Fh is somewhat analogous to FM in differentiating place of articulation in this fricative class. AmpRange was defined analogously to AmpD, as the difference in amplitude between the minimum at low frequencies and the maximum at FmaxA. AmpRange did rise and fall during the fricatives, similarly to AmpD, but its maximum value was less than that of AmpD, consistent with the premise that AmpD reflects degree of sibilance. Changes over time in the non-sibilants were estimated using measures analogous to those for the sibilants. As with the sibilants, HighLevelD was the most sensitive; Slope showed inconsistent results both during the course of a fricative within speaker and across speakers.

The results for Fh are broadly consistent with Fant's description of [f] spectra for his one Russian subject, who had an amplitude peak at 8.5 kHz. Fant's models were based on articulatory data available at the time and testing to determine which source placement best matched speech spectra. Shadle (1985, 1991) experimented with different constriction shapes placed in a tube just behind a short (1 cm) front cavity. While some of the resulting far-field spectra were similar to human speech spectra, the data did not demonstrate that the models captured the essence of /f/ or /θ/. Rather, they showed that small changes in the cross-sectional shape of the constriction can have a sizable effect on the far-field spectrum when the front cavity is so short. Perhaps the variability observed in /f/ and /θ/ spectra reflects this, and different models of source type and location are appropriate depending on factors that affect the cavity behind the constriction, the shape of the lip horn, and airflow. In speech, these factors could include phonetic context, labialization, and stress or emphasis. Certainly, Fant's mention of a spectral peak at 8.5 kHz, Shadle's model-generated spectra showing that small differences in the shape of the anterior constriction caused noticeable acoustic differences in the spectra up to 10 kHz, and the spectral differences noted here indicate that high frequencies are key to understanding what produces the differences between labiodentals and interdentals.

Our assessment of speaking style contrasted fricatives in isolated words with those in connected speech. Our expectation was that the fricatives in isolated words would be longer and, typically, receive greater emphasis, which would allow for greater development of the turbulence noise. This we expected would be apparent in the measures AmpD, LevelD, and HighLevelD for sibilants, and AmpRange, LevelD, and HighLevelD for non-sibilants. For sibilants, only HighLevelD showed a significant difference with Style (Fig. 4, bottom row); for non-sibilants, AmpRange as well as HighLevelD were significantly different with Style (Fig. 7, bottom row). We conclude that speaking style does affect some of our dependent measures; the difference between its effect on sibilants and non-sibilants points to the importance of spectral content at high frequencies and is worth investigating more closely.

In the current work, we used the same cutoff values for males and females for all measures except FM, LevelM, and LevelH for /s, z/. If, in fact, the non-sibilants do show filtering effects, i.e., if their spectra show effects of downstream cavity characteristics, it may be appropriate to develop separate cutoffs for males vs females, as has become common with the sibilants, and particularly /s/. At the same time, sex-based morphological differences in this very anterior region of the vocal tract may be limited. Consistent with this, Fox and Nissen (2005) found no significant effects of speaker gender (or age, for that matter) for /f/ and /θ/, whereas there were significant effects of gender (and age) for both /s/ and /ʃ/. Their data had a maximum value of 11 kHz, so it is possible that some differences may arise with data to 15 kHz or more. This is an area for further study. As noted in the Introduction, our state of knowledge is, in general, at a more preliminary stage for the non-sibilants compared to the sibilants.

The current dataset included fricatives in multiple speaking tasks and across a wide range of phonetic contexts. This can be useful in developing measures that are robust to the variations that occur in speech (contextual, stylistic). Of course, future work will be needed to explore and refine these measures, considering a wider range of speaker types and ages. A better balance of phonetic contexts across fricatives will also allow us to explore task differences and assess coarticulatory effects in more detail. Extending the measures to languages other than English [as in Gordon (2002)] is desirable as well.

Fricatives make greater use of the high-frequency range than most other linguistic sounds do. The present work explored how measures incorporating high-frequency energy could reveal phonological differences between American English fricatives and effects of speaking style. Along with refining previously defined measures for sibilants, we have proposed new measures of non-sibilants, in all cases seeking to ensure that these parameters, previously defined to describe the acoustic effects of articulatory and aerodynamic aspects of mechanical models, can be adapted to describing the acoustic effects of analogous speech production parameters. Along with capturing place of articulation differences for both sibilants and non-sibilants, the measures showed expected effects of voicing and speaking style on the fricative spectra. In the future, measures incorporating these high frequencies can be used to better define the acoustic realization of fricatives, particularly non-sibilants.

We express our thanks to D. H. Whalen for comments on preliminary versions of this work and to Thomas Moran, supported by a graduate student research assistantship at Adelphi University, for assistance with preliminary data processing. Thanks to the two anonymous reviewers and to Ewa Jacewicz for her help as Associate Editor of this special issue. Work was supported by National Institutes of Health Grant No. DC-002717 to Haskins Laboratories. The authors have no conflicts of interest regarding this work. We obtained ethics approval for the recordings and their analysis from the Yale University Institutional Review Board. Informed consent was obtained from all participants. The data are available from the corresponding author on reasonable request. Data types are audio files in .wav format and text files in .txt format.

1

It is also the case that recent work has called into question the long-standing view that high frequencies are essentially irrelevant in perception [see Hunter (2020) and citations therein]. As examples, these frequencies could be important in phonological development and contribute to sound localization and speech recognition in degraded listening conditions (which may not be fully captured by traditional experimentally created speech-in-speech tests; Monson , 2014).

2

As shown in Narayanan (1995), some but not all speakers regularly have two major spectral peaks for /ʃ/. To our knowledge, the source of such double peaks has never been clearly explained. Here we refer to the lower of the two peaks on the frequency axis.

3

It has been referred to as AD in some earlier works (e.g., Shadle and Mair, 1996).

4

Jongman (2000) did obtain significant differences between adult-produced /f/ and /θ/ for the third and fourth moments as well as the spectral peak frequency. However, that study did not use spectral averaging. Further, the difference between spectral peaks was only 263 Hz (7733 and 7470 Hz for /f/ and /θ/, respectively). This small difference, coupled with the poor reliability of peak measures taken from unaveraged spectra [see discussion in Shadle (2023)], calls the generalizability of those data into question.

5

Fant also modeled [f,], or “hard [f],” for his Russian speaker. For this sound, he added a laryngeal noise source to model the palatalization noise. The model predicted a fricative formant at 7500 Hz, again related to the front cavity and radiation impedance.

6

Evers (1998) obtained spectral slopes below and above 2.5 kHz, fitting regression lines up to 8 kHz. They observed that, together, the two slope measures differentiated /s/ and /ʃ/, with /ʃ/ having a steeper increase in the low-frequency range and /s/ being steeper in the higher-frequency range. This measure effectively captures the high-frequency content of /s/, but only up to 8 kHz, i.e., it takes into account the lower end of what we have called the “high-frequency range.”

7

PN combines two picture-naming tasks described in Koenig (2013): a rhotic picture-naming task and an articulation test (AT). These were designed by Preston and Edwards (2007) to target, respectively, rhotic consonants and a range of consonant clusters and syllable structures. Rhotics and clusters were of interest in the original work, which sought to characterize children with typical and atypical development, and clusters and rhotics are rather susceptible to misarticulation in children with speech sound disorders. Koenig (2013) only analyzed words containing /s/. Here we made use of all words in the two tasks having supraglottal fricatives.

1.
Avery
,
J. D.
, and
Liss
,
J. M.
(
1996
). “
Acoustic characteristics of less-masculine-sounding male speech
,”
J. Acoust. Soc. Am.
99
(
6
),
3738
3748
.
2.
Badin
,
P.
,
Shadle
,
C. H.
,
Pham Thi Ngoc
,
Y.
,
Carter
,
J. N.
,
Chiu
,
W. S. C.
,
Scully
,
C.
, and
Stromberg
,
K.
(
1994
). “
Frication and aspiration noise sources: Contribution of experimental data to articulatory synthesis
,” in
Proceedings of the 3rd International Conference on Spoken Language Processing (ICSLP 1994)
, September 18–20, Yokohama, Japan (
International Speech Communication Association
,
Baixas, France
), Vol.
1
, pp.
163
166
.
3.
Bates
,
D.
,
Maechler
,
M.
,
Bolker
,
B.
, and
Walker
,
S.
(
2015
). “
Fitting linear mixed-effects models using lme4
,”
J. Stat. Softw.
67
,
1
48
.
4.
Behrens
,
S. J.
, and
Blumstein
,
S. E.
(
1988
). “
Acoustic characteristics of English voiceless fricatives: A descriptive analysis
,”
J. Phon.
16
,
295
298
.
5.
Benjamini
,
Y.
, and
Yekutieli
,
D.
(
2005
). “
False discovery rate–adjusted multiple confidence intervals for selected parameters
,”
J. Am. Stat. Assoc.
100
,
71
81
.
6.
Blacklock
,
O. S.
(
2004
). “
Characteristics of variation in production of normal and disordered fricatives, using reduced-variance spectral methods
,” Ph.D. dissertation,
University of Southampton
,
Southampton, UK
.
7.
Boersma
,
P.
, and
Weenink
,
D.
(
2001
). “
Praat, a system for doing phonetics by computer
,”
Glot. Int.
5
(
9/10
),
341
345
.
8.
Cox
,
S. R.
,
Shadle
,
C. H.
, and
Chen
,
W.-R.
(
2019
). “
Acoustic variability in electrolaryngeal speech
,”
J. Acoust. Soc. Am.
146
(
4
),
2921
.
9.
Evers
,
V.
,
Reetz
,
H.
, and
Lahiri
,
A.
(
1998
). “
Cross-linguistic acoustic categorization of sibilants independent of phonological status
,”
J. Phon.
26
,
345
370
.
10.
Fant
,
C. G. M.
(
1970
).
Acoustic Theory of Speech Production
(
Mouton de Gruyter
,
The Hague, Netherlands
).
11.
Flanagan
,
J. L.
(
1972
).
Speech Analysis, Synthesis, and Perception
,
2nd ed.
(
Springer Verlag
,
Berlin
).
12.
Forrest
,
K.
,
Weismer
,
G.
,
Milenkovic
,
P.
, and
Dougall
,
R. N.
(
1988
). “
Statistical analysis of word-initial voiceless obstruents: Preliminary data
,”
J. Acoust. Soc. Am.
84
,
115
123
.
13.
Fox
,
R. A.
, and
Nissen
,
S. L.
(
2005
). “
Sex-related acoustic changes in voiceless English fricatives
,”
J. Speech Lang. Hear. Res.
48
(
4
),
753
765
.
14.
French
,
N. R.
, and
Steinberg
,
J. C.
(
1947
). “
Factors governing the intelligibility of speech sounds
,”
J. Acoust. Soc. Am.
19
(
1
),
90
119
.
15.
Fuchs
,
S.
, and
Toda
,
M.
(
2010
). “
Do differences in male versus female /s/ reflect biological or sociophonetic factors
,” in
Turbulent Sounds: An Interdisciplinary Guide
, edited by
S.
Fuchs
,
M.
Toda
, and
M.
Żygis
(
Mouton de Gruyter
,
Berlin
), pp.
281
302
.
16.
Gordon
,
M.
,
Barthmaier
,
P.
, and
Sands
,
K.
(
2002
). “
A cross-linguistic acoustic study of voiceless fricatives
,”
J. Int. Phon. Assoc.
32
,
141
173
.
17.
Harris
,
K. S.
(
1958
). “
Cues for the discrimination of American English fricatives in spoken syllables
,”
Lang. Speech
1
,
1
7
.
18.
Hughes
,
G. W.
, and
Halle
,
M.
(
1956
). “
Spectral properties of fricative consonants
,”
J. Acoust. Soc. Am.
28
,
303
310
.
19.
Hunter
,
L. L.
,
Monson
,
B. B.
,
Moore
,
D. R.
,
Dhar
,
S.
,
Wright
,
B. A.
,
Munro
,
K. J.
,
Zadeh
,
L. M.
,
Blankenship
,
C. M.
,
Stiepan
,
S. M.
, and
Siegel
,
J. H.
(
2020
). “
Extended high frequency hearing and speech perception implications in adults and children
,”
Hear. Res.
397
,
107922
.
20.
International Telecommunication Union
(
2007
). “
P.862.2: Wideband extension to Recommendation P.862 for the assessment of wideband telephone networks and speech codecs
” (
International Telecommunication Union
,
Geneva, Switzerland
).
21.
Jacewicz
,
E.
, and
Fox
,
R. A.
(
2020
). “
Summary of ‘Reintroducing the high-frequency region to speech perception research,’ Special Session
,”
Proc. Mtgs. Acoust.
42
(
1
),
002001
.
22.
Jassem
,
W.
(
1965
). “
The formants of fricative consonants
,”
Lang. Speech
8
(
1
),
1
16
.
23.
Jassem
,
W.
(
1979
). “
Classification of fricative spectra using statistical discriminant functions
,” in
Frontiers of Speech Communication Research
, edited by
B.
Lindblom
and
S.
Öhman
(
Academic
,
London
), pp.
77
91
.
24.
Jesus
,
L. M. T.
, and
Shadle
,
C. H.
(
2002
). “
A parametric study of the spectral characteristics of European Portuguese fricatives
,”
J. Phon.
30
,
437
464
.
25.
Jongman
,
A.
,
Wayland
,
R.
, and
Wong
,
S.
(
2000
). “
Acoustic characteristics of English fricatives
,”
J. Acoust. Soc. Am.
108
,
1252
1263
.
26.
Koenig
,
L. L.
,
Shadle
,
C. H.
,
Preston
,
J. L.
, and
Mooshammer
,
C. R.
(
2013
). “
Toward improved spectral measures of /s/: Results from adolescents
,”
J. Speech Lang. Hear. Res.
56
(
4
),
1175
1189
.
27.
Krane
,
M. H.
(
2005
). “
Aeroacoustic production of low-frequency unvoiced speech sounds
,”
J. Acoust. Soc. Am.
118
(
1
),
410
427
.
28.
Kuznetsova
,
A.
,
Brockhoff
,
P. B.
, and
Christensen
,
R. H. B.
(
2017
). “
lmerTest package: Tests in linear mixed effects models
,”
J. Stat. Softw.
82
,
1
26
.
29.
Lenth
,
R. V.
(
2022
). “
emmeans: Estimated marginal means, aka least-squares means (R package version 1.8.1-1)
,” https://CRAN.R-project.org/package=emmeans (Last viewed September 13, 2022).
30.
Maniwa
,
K.
,
Jongman
,
A.
, and
Wade
,
T.
(
2009
). “
Acoustic characteristics of clearly spoken English fricatives
,”
J. Acoust. Soc. Am.
125
,
3962
3973
.
31.
May
,
J.
(
1976
). “
Vocal tract normalization for /s/ and /ʃ/
,” in
Haskins Laboratories Status Report on Speech Research SR-48
(
Haskins Laboratories
,
New Haven, CT
), pp.
67
73
.
32.
McAuliffe
,
M.
,
Socolof
,
M.
,
Mihuc
,
S.
,
Wagner
,
M.
, and
Sonderegger
,
M.
(
2017
). “
Montreal Forced Aligner: Trainable text-speech alignment using Kaldi
,” in
Proceedings of Interspeech 2017
, August 20–24,
Stockholm, Sweden
(International Speech Communication Association, Stockholm, Sweden), pp.
498
502
.
33.
McGowan
,
R. S.
, and
Nittrouer
,
S.
(
1988
). “
Differences in fricative production between children and adults: Evidence from an acoustic analysis of /ʃ/ and /s/
,”
J. Acoust. Soc. Am.
83
,
229
236
.
34.
McMurray
,
B.
, and
Jongman
,
A.
(
2011
). “
What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations
,”
Psych. Rev.
118
(
2
),
219
246
.
35.
Monson
,
B. B.
,
Hunter
,
E. J.
,
Lotto
,
A. J.
, and
Story
,
B. H.
(
2014
). “
The perceptual significance of high-frequency energy in the human voice
,”
Front. Psych.
5
,
587
.
36.
Munson
,
B.
,
Jefferson
,
S. V.
, and
McDonald
,
E. C.
(
2006
). “
The influence of perceived sexual orientation on fricative identification
,”
J. Acoust. Soc. Am.
119
,
2427
2437
.
37.
Narayanan
,
S. S.
(
1995
). “
Fricative consonants: An articulatory, acoustic, and systems study
,” Ph.D. dissertation,
University of California, Los Angeles
,
Los Angeles, CA
.
38.
Nirgianaki
,
E.
(
2014
). “
Acoustic characteristics of Greek fricatives
,”
J. Acoust. Soc. Am.
135
(
5
),
2964
2976
.
39.
Preston
,
J. L.
, and
Edwards
,
M. L.
(
2007
). “
Phonological processing skills of adolescents with residual speech sound errors
,”
Lang. Speech Hear. Serv. Sch.
38
,
297
308
.
40.
Preston
,
J. L.
, and
Edwards
,
M. L.
(
2009
). “
Speed and accuracy of rapid speech output by adolescents with residual speech sound errors including rhotics
,”
Clin. Linguist. Phon.
23
(
4
),
301
318
.
41.
R Core Team
(
2020
).
R: A Language and Environment for Statistical Computing
(
R Foundation for Statistical Computing
,
Vienna, Austria
).
42.
Reilly
,
J.
, and
Fisher
,
J. L.
(
2012
). “
Sherlock Holmes and the strange case of the missing attribution: A historical note on ‘The Grandfather Passage’ [Letter to the Editor]
,”
J. Speech. Lang. Hear. Res.
55
,
84
88
.
43.
Semel
,
E.
,
Wiig
,
E. H.
, and
Secord
,
W. A.
(
2003
).
Clinical Evaluation of Language Fundamentals
,
4th ed
. (
Harcourt Assessment
,
San Antonio, TX
).
44.
Shadle
,
C. H.
(
1985
). “
The acoustics of fricative consonants
,” Ph.D. dissertation,
Massachusetts Institute of Technology
,
Cambridge, MA
.
45.
Shadle
,
C. H.
(
1991
). “
The effect of geometry on source mechanisms of fricative consonants
,”
J. Phon.
19
,
409
424
.
46.
Shadle
,
C. H.
(
2023
). “
Alternatives to moments for characterizing fricatives: Reconsidering Forrest et al. (1988)
,”
J. Acoust. Soc. Am.
153
(
2
),
1412
1426
.
47.
Shadle
,
C. H.
,
Badin
,
P.
, and
Moulinier
,
A.
(
1991
). “
Towards the spectral characteristics of fricative consonants
,” in
Proceedings of the XIIth International Congress of Phonetic Sciences
, August 19–24,
Aix-en-Provence, France
(University of Provence Aix-Marseille I, Aix-en-Provence, France), Vol.
3
, pp.
42
45
.
48.
Shadle
,
C. H.
,
Chen
,
W.-R.
,
Koenig
,
L. L.
, and
Preston
,
J. L.
(
2020a
). “
Fricative variability in normal adult speakers
,”
J. Acoust. Soc. Am.
148
,
2579
.
49.
Shadle
,
C. H.
,
Chen
,
W.-R.
,
Koenig
,
L. L.
, and
Preston
,
J. L.
(
2020b
). “
Acoustic variability of fricatives in normal adults
,” in
Proceedings of the 12th International Seminar on Speech Production (ISSP) 2020
, December 14–18 (Haskins Press, New Haven, CT) (Virtual).
50.
Shadle
,
C. H.
,
Chen
,
W.-R.
, and
Whalen
,
D. H.
(
2016
). “
Stability of the main resonance frequency of fricatives despite changes in the first spectral moment
,”
J. Acoust. Soc. Am.
140
,
3219
3220
.
51.
Shadle
,
C. H.
,
Chen
,
W.-R.
, and
Whalen
,
D. H.
(
2017
). “
Articulatory-acoustic relationships for [s] in the XRMB database
,” in
Proceedings of the International Seminar on Speech Production
, October 16–19,
Tianjin, China
(University of Tianjin, Tianjin, China).
52.
Shadle
,
C. H.
,
Koenig
,
L. L.
, and
Preston
,
J. L.
(
2014
). “
Acoustic characterization of /s/ spectra of adolescents: Moving beyond moments
,”
Proc. Mtgs. Acoust.
12
(
1
),
060006
.
53.
Shadle
,
C. H.
, and
Mair
,
S. J.
(
1996
). “
Quantifying spectral characteristics of fricatives
,” in
Proceedings of the Fourth International Conference on Spoken Language Processing
, October 3–6, Philadelphia, PA (
IEEE
,
New York
), pp.
1521
1524
.
54.
Shadle
,
C. H.
,
Mair
,
S.
, and
Carter
,
J. N.
(
1996
). “
Acoustic characteristics of the front fricatives [f, v, θ, δ]
,” in
Proceedings of the 1st European Speech Communication Association Tutorial and Research Workshop on Speech Production Modeling and 4th Speech Production Seminar
, May 21–24,
Autrans, France
, pp.
193
196
.
55.
Shadle
,
C. H.
, and
Scully
,
C.
(
1995
). “
An articulatory-acoustic-aerodynamic analysis of [s] in VCV sequences
,”
J. Phon.
23
,
53
66
.
56.
Shadle
,
C. H.
,
Stone
,
M. L.
, and
Chen
,
W.-R.
(
2018
). “
The acoustic consequences for sibilants produced by glossectomies compared to healthy controls
,”
J. Acoust. Soc. Am.
144
,
1967
.
57.
Stevens
,
K. N.
(
1971
). “
Airflow and turbulence noise for fricative and stop consonants: Static considerations
,”
J. Acoust. Soc. Am.
50
,
1180
1192
.
58.
Stevens
,
K. N.
(
1998
).
Acoustic Phonetics
(
MIT
,
Cambridge, MA
).
59.
Strevens
,
P.
(
1960
). “
Spectra of fricative noise in human speech
,”
Lang. Speech
3
,
32
49
.
60.
Tabain
,
M.
(
1998
). “
Non-sibilant fricatives in English: Spectral information above 10 kHz
,”
Phonetica
55
,
107
130
.
61.
Thomson
,
D. J.
(
1982
). “
Spectrum estimation and harmonic analysis
,”
Proc. IEEE
70
,
1055
1096
.
62.
Tjaden
,
K.
, and
Turner
,
G. S.
(
1997
). “
Spectral properties of fricatives in amyotrophic lateral sclerosis
,”
J. Speech. Lang. Hear. Res.
40
,
1358
1372
.
63.
Whalen
,
D. H.
(
1991
). “
Perception of the English /s/-/ʃ/ distinction relies on fricative noises and transitions, not on brief spectral slices
,”
J. Acoust. Soc. Am.
90
(
4
),
1776
1785
.