The perceived level of femininity and masculinity is a prominent property by which a speaker's voice is indexed, and a vocal expression incongruent with the speaker's gender identity can greatly contribute to gender dysphoria. Our understanding of the acoustic cues to the levels of masculinity and femininity perceived by listeners in voices is not well developed, and an increased understanding of them would benefit communication of therapy goals and evaluation in gender-affirming voice training. We developed a voice bank with 132 voices with a range of levels of femininity and masculinity expressed in the voice, as rated by 121 listeners in independent, individually randomized perceptual evaluations. Acoustic models were developed from measures identified as markers of femininity or masculinity in the literature using penalized regression and tenfold cross-validation procedures. The 223 most important acoustic cues explained 89% and 87% of the variance in the perceived level of femininity and masculinity in the evaluation set, respectively. The median fo was confirmed to provide the primary cue, but other acoustic properties must be considered in accurate models of femininity and masculinity perception. The developed models are proposed to afford communication and evaluation of gender-affirming voice training goals and improve voice synthesis efforts.

While central to our perception of voices, there is no comprehensive model that acoustically describes how feminine or masculine a speaking voice is likely to be perceived by a listener. This lack of a reliable description of femininity and masculinity in a voice has direct implications in all applied acoustic efforts related to how a voice is perceived, but we put forward here specifically the field of clinical gender-affirming voice training as a frame for illustrating the benefit offered by a reliable model of these properties of voices.

An increasing number of transgender and gender-diverse persons have been seeking health care services due to gender dysphoria, inferring that their gender identity is not congruent with their assigned gender at birth, associated with significant distress and impaired social functioning.1 Pursuing a gender expression that is more congruent with the identity of the person can be a way to lessen gender dysphoria, and voice modification toward a better alignment with the gender identity has been put forward as a critical factor for being perceived in accordance with their gender identity by others and increased well-being.2–4 The speech-language pathologist (SLP) taking on a new client cannot assume that the decision on a particular speech pattern as the target for training will lead to beneficial outcomes in the client's well-being. Some transgender and gender-diverse clients aim for an unequivocally female or male voice expression, that is, a voice that conforms with the voices of cisgender speakers whose gender identity aligns with the biologically and anatomically defined sex (man or woman) they were assigned at birth. For other clients, the gender identity does not align with a binary gender presentation (non-binary), and their preferred gender expression may relate more freely to the perceptual scales of femininity and masculinity.5,6 SLPs providing gender-affirming voice training, therefore, need solid knowledge about which aspects of voice and speech are most salient for clients' desired vocal expression to successfully deliver the best match between the desired training target and the outcome. Thus, while we envision that a thorough understanding of the acoustic cues listeners may use to determine the femininity or masculinity of voices could be applied in many domains, gender-affirming voice training is put forward here as one domain where a reliable model of perception could be directly applied to result in a more reliable practice.

To provide the foundation for answering the question of what aspects of voice (and speech) are most salient in listener perceptions of speaker gender, Leung et al.7 provided a systematic review of 38 peer-reviewed published articles and theses concerned with acoustic cues to speaker gender, or level of femininity or masculinity perceived through the voice. The review identified the most salient acoustic and auditory-perceptual vocal parameters in guiding listeners' perceptions, grouped into three domains: voice (pitch, resonance, loudness, and vocal quality), speech (articulation), and prosody (intonation, tempo, and stress). The notation used for acoustic quantities follows the suggestions by Titze et al.8 The fundamental frequency of vocal fold oscillations ( fo), center frequencies of the first four vowel formants (F1F4), and the intonation patterns used were identified as key features in the perception of a speaker as male/masculine or female/feminine. In addition, aspects of voice quality (e.g., breathiness), prosodic variation (of speech rate and vowel duration), and variation in loudness were auditory-perceptual features that contributed to the perception of a female or feminine voice. Examples of acoustic measures influencing the perception of femininity in voice were seen related to articulation (spectral properties of /s/, and of /ʃ/, which have been recently highlighted as gender markers9), voice quality (differences between the level of the second and fourth harmonics, L2L4, and the level of the first harmonic and the third formant frequency, L1LF3), semitone, and dynamic range. The perception of a voice as male or masculine was related to acoustic measures of formant spacing.7 

The acoustic predictors have primarily been studied in terms of their ability to, in isolation, separate speakers with different sexes or genders (main effects) and have only more recently been evaluated in terms of the relative strength of the acoustic cues. Leung et al.10 recently evaluated the relative contribution and cross-correlations of vowel formant frequencies (F1F3) and the fundamental frequency (fo) to the perception of a speaker's gender and vocal expression on a femininity–masculinity scale using principal component analysis. They found significant contributions of primarily F1 and fo to the model of femaleness and all general levels of fo and F1F3 to the perception of a voice on a femininity–masculinity scale. This finding was recently replicated in parts by Merritt et al.11 who found that the rated similarity of voices using tasks related to the perception of level of femininity and masculinity was closely connected with fo, speaker indexed F1, and vowel space area, and by Södersten et al.12 who showed that fo, average formant frequency, and Leq, most strongly predicted listener ratings of a female sounding voice in trans women before and after gender-affirming voice training and cisgender voices. Leung et al.10 further found that the listener's initial determination that a voice belonged to a female or male speaker influenced how the voice was perceived in terms of femininity–masculinity. The combined effect of modifying both fo and F1F4 has been shown to be more likely to change listeners' perceptions compared to shifting one measure alone.10,13 The analysis of Leung et al.10 did, however, show a substantial unexplained variation, further emphasizing that the listeners' classification of a speaker in terms of gender, or in terms of level of femininity and masculinity is multidimensional in nature.7,13 Previous studies have, with a few exceptions,11,12,14,15 used synthesized speech, or recordings of the speaker reading a standard text16 or other forms of controlled speech tasks in the data collection.7 When spontaneous speech has been analyzed, it has been confirmed that fo is not a sufficient predictor of the perceived level of femininity and masculinity in voices, and that shifted formant frequencies provide important additional cues.15 A more extensive evaluation of which acoustic properties serve as cues to the perceived level of femininity and masculinity in spontaneous speech have not been attempted.

The current investigation aimed to determine which of 312 acoustic properties identified in the literature7 most accurately predict the level of femininity and masculinity in spontaneous speech. While speaker sex, gender, and the level of perceived femininity and masculinity have not always been treated separately in the previous research on which this study was developed, the dimensions of concern here were the continua of perceived level of femininity and masculinity perceived in the voice, independent of sex assigned at birth or the gender identity of the speaker. A secondary aim was to provide information on how the acoustic predictors co-occur and provide supplementary or supporting cues to the perception of a voice. A penalized regression procedure, the Least absolute shrinkage and selection operator (LASSO), was used to provide linear models with a high likelihood of being generalizable outside the training data. We also wanted to take advantage of the feature selection effect of L1 penalization to build the models from minimally correlated acoustic predictors. The perceived level of femininity and the level of masculinity in voices were assessed separately, with stimuli presented in randomized order, to minimize the potential order effect of stimuli17 and provide the most reliable and unbiased model.

The study aimed to develop and perceptually evaluate a bank of voices with a full range of levels of femininity and masculinity as perceived by cisgender men and women, transgender and gender-diverse listeners, and professional speech-language pathologists.

1. Speech material

Voice samples from 132 adult speakers were recorded in the speaker's homes or another suitable recording environment. The sound level of the surrounding was ensured to be less than 38 dB(A)18 before recording started. The speech recordings were collected using the application AVR X (Newkline Co., Ltd.) on Android mobile phones. A 44.1 kHz sampling rate and 128 kbps recording rate were used, with automatic gain control to be disabled.19 To capture the recordings, an omnidirectional RØDE SmartLav+ microphone with a frequency range of 20 Hz to 20 kHz and a signal-to-noise ratio of 67 was used. The microphone was head-mounted 5 cm from the speaker's mouth.20 Calibration was performed at a 5 cm distance using SPL level meter and a reference tone using the Sopran software.21 No reference tone was recorded with the speech of the participant. The participants were recruited to cover a wide age range of speakers and collect speech recordings likely to cover the full range of perceived femininity and masculinity levels in their voices. The recruitment effort was therefore directed toward both cisgender speakers using a combination of convenience and snowball sampling by the authors, associated students, and research assistants, and toward transgender and gender-diverse clients starting or undergoing gender-affirming voice training. Speakers were asked to self-report their gender identity by choosing from given options (“man,” “woman,” “non-binary,” “unsure/do not know”), or “other” in which case the listeners were asked to define their gender identity in their own words.

As a result of the recruitment, voice samples were collected from 54 cisgender women (36.1 ± 12.3 years), 41 cisgender men (35.8 ± 12.3 years), and 37 transgender and gender-diverse persons with varying gender identities (28.3 ± 9.9 years). Among the transgender and gender-diverse group, 11 persons were undergoing voice training in a voice masculinizing direction, 11 persons were undergoing voice training toward a more feminine sounding voice, and 15 persons had not taken part in gender-affirming voice training. All speakers were native speakers of Swedish. Information about speakers' hearing, smoking history, gender-affirming hormone therapy, and preexisting disorders that could potentially affect voice or speech, including medically treated asthma, neurological, and neuropsychiatric disorders, was collected. Mild dysphonia was not judged as a reason for exclusion. A few speakers reported a previous or present voice disorder; however, no speaker was judged by the authors to be influenced by any of these factors to a degree that their voice did not reflect typical voices perceived in everyday life.

The speakers were asked to describe some activity or hobby they enjoyed performing. The sample assessed in the perceptual rating included spontaneous speech that was considered free of distracting noise and with content judged by the authors not to contain gendered information. For each speaker, extraction of a sample of approximately 10 s in length and containing at least three phrases was aimed for. The resulting spontaneous speech samples used in the listening experiment contained 3–5 phrases and had an average duration, with standard deviation, of 11.4 ± 3.8 s.

2. Perceptual rating procedure

Listeners for the perceptual evaluation of masculinity and femininity in the collected bank of voices were recruited using a combination of broad and targeted strategies. The primary recruitment approach was based on the networks of the researchers and student research assistants associated with the study. In addition, interest groups for transgender and gender-diverse people were approached in the recruitment effort to support a broad representation of gender identities in the sample of listeners. Finally, SLPs providing gender-affirming voice training nationally were approached and asked to participate in the evaluation. All listeners were asked to self-report their gender identity by choosing from given options (“man,” “woman,” “transmasculine,” “transfeminine,” “non-binary,” “unsure/do not know”), or “other” in which case the listeners were asked to define their gender identity in their own words. The listeners were native speakers of Swedish and stated their hearing to be sufficient for verbal communication. Information about preexisting neurologic and neuropsychiatric disorders that could affect voice perception was collected.22 An overview of biographic information for the 121 listeners who took part in the perceptual evaluation and the listening context of their participation is presented in Table I.

TABLE I.

The average ages (with standard deviation and range) for listener participants grouped by their sex assigned at birth, reported gender identity, and the listening context in which they participated in the perceptual evaluation. The number of participants in each subgroup of listeners (N) is also indicated.

Listener group Assigned sex at birth Reported gender identity Listening context N Age distribution in the listener group (years)
Cisgender  Male  Man  In person  27  41.4 ± 13.2 (19–63) 
    Man  Online  50.3 ± 7.9 (42–65) 
  Female  Woman  In person  27  36.9 ± 14.3 (18–59) 
      Woman  Online  17  39.4 ± 12.3 (19–55) 
Transgender and gender-diverse  Male  Woman  In person  43 
  Woman  Online  37.9 ± 16.1 (19–63) 
    Transfeminine  Online  33.3 ± 19.7 (21–56) 
    Non-binary  In person  27 
Unsure  In person  28 
  Female  Man  Online  22.5 ± 5.1 (19–30) 
    Transmasculine  In person  25  
    Transmasculine  Online  25.2 ± 6.5 (19–34) 
    Non-binary  In person  28.5 ± 4.9 (25–32) 
    Non-binary  Online  32.7 ± 10.4 (21–41) 
Speech and language pathologist  Female  Woman  Online  14  49.4 ± 10.9 (35–65) 
Listener group Assigned sex at birth Reported gender identity Listening context N Age distribution in the listener group (years)
Cisgender  Male  Man  In person  27  41.4 ± 13.2 (19–63) 
    Man  Online  50.3 ± 7.9 (42–65) 
  Female  Woman  In person  27  36.9 ± 14.3 (18–59) 
      Woman  Online  17  39.4 ± 12.3 (19–55) 
Transgender and gender-diverse  Male  Woman  In person  43 
  Woman  Online  37.9 ± 16.1 (19–63) 
    Transfeminine  Online  33.3 ± 19.7 (21–56) 
    Non-binary  In person  27 
Unsure  In person  28 
  Female  Man  Online  22.5 ± 5.1 (19–30) 
    Transmasculine  In person  25  
    Transmasculine  Online  25.2 ± 6.5 (19–34) 
    Non-binary  In person  28.5 ± 4.9 (25–32) 
    Non-binary  Online  32.7 ± 10.4 (21–41) 
Speech and language pathologist  Female  Woman  Online  14  49.4 ± 10.9 (35–65) 

All 132 extracted voice samples were rated by listeners two times, in two blocks, with stimuli presentation orders randomized separately for each block and each listener. The block orders were counterbalanced so that every other participant was asked to rate masculinity first, either by the researcher (in person) or by the experiment link they were provided (online). In each trial, the listener was asked to rate either the level of perceived masculinity (Block 1) or femininity (Block 2) in the speaker's voice on a visual analog scale presented in the center of the screen once the stimulus presentation had been completed. The total listening time for each block was 25 min, and the listeners were encouraged to take a short break between the two blocks. Perceptual ratings of femininity and masculinity in the voice samples were managed using the PsychoPy23 in in-person sessions or, due to the onset of the Covid-19 pandemic, online on the Pavlovia24,25 experiment presentation platform. The distribution of listeners across online and in-person listening situations is presented in Table I.

In addition to the auditory-perceptual evaluations of the voices, the listeners were asked to report their own age, sex assigned at birth, and gender identity in an online form. Listeners who were speech and language pathologists providing gender-affirming voice training in their profession were asked to indicate this in the form.

3. Statistical procedure

The agreement of raters' perception of voices regarding the level of femininity and masculinity was evaluated using the intraclass correlation coefficient (two-way mixed effects model, absolute agreement).26 The strength of the association between listeners' ratings of femininity and masculinity for a particular voice was assessed using Pearson's correlation coefficient.

The study aimed to establish the relative strengths of acoustic measures contributing to the perception of femininity and masculinity in speakers' speech.

1. Materials

Spontaneous speech samples produced by the 132 speakers were submitted to a manual markup procedure to identify all intervals of the speech signal required to compute 312 acoustic measures inferred from previous reports of acoustic properties influencing the perception of femininity or masculinity in the voice.7 The manual markup procedure identified the start and end of all individual speech utterances, fricatives, plosives, and vowels in stressed syllables (Table II). Further, the start and end points of plosive bursts and aspiration and the release burst were marked manually from a waveform display of the speech signal. The appropriateness of fo analysis floor and ceiling settings suggested by the speaker's reported sex assigned at birth were further evaluated visually against the harmonic structure in a narrow-band spectrogram and adjusted to provide the best agreement between the computed fo and the narrow-band spectrogram display. Similarly, formant frequency ceiling and analysis window size from the settings suggested by the speaker's sex assigned at birth were selected to fit the individual speaker. Individualized fo and formant frequency extraction settings were noted for subsequent acoustic analyses and listed in supplementary material A for each participant. All analyses were performed using the Praat software package (version 6.3.10); settings not specified in supplementary material A were left at default values.

TABLE II.

The number of intervals of each type of interest extracted for acoustic analysis of spontaneous speech samples. The frequencies are summarized as the average number of occurrences (with standard deviation) and range across speakers.

Speech intervals Number of analyzed intervals per speaker
Utterances  14.4 ± 4.9 (7–35) 
Fricatives  26.7 ± 10.6 (8–68) 
Plosives  50.1 ± 17.8 (10–103) 
Vowels  42.8 ± 17.7 (15–105) 
Speech intervals Number of analyzed intervals per speaker
Utterances  14.4 ± 4.9 (7–35) 
Fricatives  26.7 ± 10.6 (8–68) 
Plosives  50.1 ± 17.8 (10–103) 
Vowels  42.8 ± 17.7 (15–105) 

2. Acoustic analysis

Acoustic properties related to measures of interest in previous research for the perception of either masculinity or femininity in the voice7 were extracted using the PraatSauce27 collection of procedures or custom scripts for the Praat speech signal processing software.28 The settings identified as appropriate for extracting fo and formant frequency information for the speaker were used when applying the signal processing procedures. The results were stored in signal tracks compatible with the Emu Speech Database Management system,29 which was subsequently used to consistently extract measurements from tracks for statistical processing. Since the planned statistical procedure LASSO performs feature selection while fitted due to hyper-parameter tuning, the extracted acoustic properties were expanded somewhat compared to the features discussed in the review of Leung et al.7 to guard against the risk of previous works having identified a general property for which there exists a closely related alternative quantification. For instance, since the review of Leung et al.7 suggested a simple spectral difference measure to be indicative of either perceived level of femininity or masculinity, such as L1L4 or L1LF3 (in the notation suggested by Titze et al.8) all combinations of similar distances in the indicated acoustic domains up to the fourth harmonic or formant were also included. Further, harmonic amplitudes were considered in an uncorrected form and in the form where the influence of adjacent formants had been considered.30 An overview of the extracted acoustic quantities is presented in Table III. All formant frequency values were converted from the Hz value extracted from the software to Bark scale using the analytical approximation
F Bark = 26.81 1 + 1960 F 0.53 ,
where F is the formant frequency in Hz, proposed by Tranmüller.31 Similarly, the fundamental frequency fo was analyzed in semitone scale (St), using a 16.35160 Hz reference.32 
TABLE III.

Descriptions of the extracted acoustic properties, the interval type they were extracted from, the per utterance summary statistics computed, and how utterance-level statistics were transformed to a summary of the speakers' spontaneous speech properties.

Interval type Interval Acoustic base property Utterance-level acoustic measures Speaker-level summary statistics
Suprasegmental  Utterance  Fundamental frequency ( fo)a,b  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Coefficient of variability across an utterance  Standard deviation across all utterances 
Range in an utterance 
    Fundamental frequency changea at the end of an utterance  Change in fo in the utterance final tonal movement  The median across all utterances 
Standard deviation across all utterances 
    Speech signal intensityb (dB)  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Coefficient of variability across an utterance  Standard deviation across all utterances 
Range in an utterance 
    Harmonic-to-noise in 0–500 Hz, 0–1500 Hz, 0–2500 Hz, and 0–3500 Hz frequency regionsc  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Range in an utterance  Standard deviation across all utterances 
    Smoothed cepstral peak prominence (CPP)d  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Coefficient of variability across an utterance  Standard deviation across all utterances 
Range in an utterance 
Segmental  Vowels in stressed syllables  Mid vowel formant frequencies, in Bark (F1F4 Median across 5 measurement points centered in the vowel  Median across vowels produced in all utterances 
Inter-formant differences in vowel center median frequencies (F2–F1, F3–F1, …, F5–F4 Vowel space area36  
    Mid vowel formant bandwidths (B1B5)e  Median across 5 measurement points centered in the vowel  Median across vowels produced in all utterances 
Inter-formant differences in vowel center median bandwidths (B2–B1, B3–B1, …, B5–B4 Standard deviation across vowels produced in all utterances 
    Mid vowel harmonic frequenciesb,f directly measured ( f1, f2, f3) and with correction for the impact of formantsg  Median across 5 measurement points centered in the vowel (f1, f2, f3, f1c, f2c, f3c The median across vowels produced in all utterances 
Inter-harmonic differences of vowel center median frequencies (f2f1, f3f1, f3f2) and (f2cf1c, f3cf1c, f3cf2c Standard deviation across vowels produced in all utterances 
    Mid vowel amplitude of harmonics closest to the first four formantsh (LH(F1), LH(F2), LH(F3), LH(F4)), directly measured and with correction for the impact of formantsg  Median across 5 measurement points centered in the vowel (LH(F1) to LH(F3)) and corresponding corrected differences  The median across vowels produced in all utterances 
Inter-harmonic differences of harmonic amplitudes (LH(F2)LH(F1), LH(F3)LH(F1), LH(F3)L H(F2), LH(F4)LH(F2) Standard deviation across vowels produced in all utterances 
    Harmonic-to-noise in 0–500 Hz, 0–1500 Hz, 0–2500 Hz, and 0–3500 Hz frequency regionsi  Median across 5 measurement points centered in the vowel  The median across all vowels produced in all utterances 
Standard deviation across all vowels produced in all utterances 
    Smoothed cepstral peak prominence (CPP)d  Median across 5 measurement points centered in the vowel  The median across all vowels produced in all utterances 
Standard deviation across all vowels produced in all utterances 
  Fricatives  Center of gravity, skewness, kurtosis, and standard deviation of the spectrum  Median across 10 measurement points centered in the fricative  The median within all fricatives of a type produced in all utterances 
The median within all fricatives of a type produced in all utterances 
  Plosives  Spectrum center of gravity, skewness, kurtosis, and standard deviation  Median across 5 centered measurement points of the plosive release transient, release burst, and aspiration intervals  The median within all plosives of a type produced in all utterances 
The median within all plosives of a type produced in all utterances 
    RMS amplitude  Median across 5 centered measurement points of the plosive release transient, release burst, and aspiration intervals  The median within all plosives of a type produced in all utterances 
The median within all plosives of a type produced in all utterances 
Interval type Interval Acoustic base property Utterance-level acoustic measures Speaker-level summary statistics
Suprasegmental  Utterance  Fundamental frequency ( fo)a,b  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Coefficient of variability across an utterance  Standard deviation across all utterances 
Range in an utterance 
    Fundamental frequency changea at the end of an utterance  Change in fo in the utterance final tonal movement  The median across all utterances 
Standard deviation across all utterances 
    Speech signal intensityb (dB)  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Coefficient of variability across an utterance  Standard deviation across all utterances 
Range in an utterance 
    Harmonic-to-noise in 0–500 Hz, 0–1500 Hz, 0–2500 Hz, and 0–3500 Hz frequency regionsc  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Range in an utterance  Standard deviation across all utterances 
    Smoothed cepstral peak prominence (CPP)d  Utterance average  The median across all utterances 
Standard deviation across an utterance 
Coefficient of variability across an utterance  Standard deviation across all utterances 
Range in an utterance 
Segmental  Vowels in stressed syllables  Mid vowel formant frequencies, in Bark (F1F4 Median across 5 measurement points centered in the vowel  Median across vowels produced in all utterances 
Inter-formant differences in vowel center median frequencies (F2–F1, F3–F1, …, F5–F4 Vowel space area36  
    Mid vowel formant bandwidths (B1B5)e  Median across 5 measurement points centered in the vowel  Median across vowels produced in all utterances 
Inter-formant differences in vowel center median bandwidths (B2–B1, B3–B1, …, B5–B4 Standard deviation across vowels produced in all utterances 
    Mid vowel harmonic frequenciesb,f directly measured ( f1, f2, f3) and with correction for the impact of formantsg  Median across 5 measurement points centered in the vowel (f1, f2, f3, f1c, f2c, f3c The median across vowels produced in all utterances 
Inter-harmonic differences of vowel center median frequencies (f2f1, f3f1, f3f2) and (f2cf1c, f3cf1c, f3cf2c Standard deviation across vowels produced in all utterances 
    Mid vowel amplitude of harmonics closest to the first four formantsh (LH(F1), LH(F2), LH(F3), LH(F4)), directly measured and with correction for the impact of formantsg  Median across 5 measurement points centered in the vowel (LH(F1) to LH(F3)) and corresponding corrected differences  The median across vowels produced in all utterances 
Inter-harmonic differences of harmonic amplitudes (LH(F2)LH(F1), LH(F3)LH(F1), LH(F3)L H(F2), LH(F4)LH(F2) Standard deviation across vowels produced in all utterances 
    Harmonic-to-noise in 0–500 Hz, 0–1500 Hz, 0–2500 Hz, and 0–3500 Hz frequency regionsi  Median across 5 measurement points centered in the vowel  The median across all vowels produced in all utterances 
Standard deviation across all vowels produced in all utterances 
    Smoothed cepstral peak prominence (CPP)d  Median across 5 measurement points centered in the vowel  The median across all vowels produced in all utterances 
Standard deviation across all vowels produced in all utterances 
  Fricatives  Center of gravity, skewness, kurtosis, and standard deviation of the spectrum  Median across 10 measurement points centered in the fricative  The median within all fricatives of a type produced in all utterances 
The median within all fricatives of a type produced in all utterances 
  Plosives  Spectrum center of gravity, skewness, kurtosis, and standard deviation  Median across 5 centered measurement points of the plosive release transient, release burst, and aspiration intervals  The median within all plosives of a type produced in all utterances 
The median within all plosives of a type produced in all utterances 
    RMS amplitude  Median across 5 centered measurement points of the plosive release transient, release burst, and aspiration intervals  The median within all plosives of a type produced in all utterances 
The median within all plosives of a type produced in all utterances 
a

In semitones relative to the 16.35160 Hz reference used in Phonogram construction (Ref. 32).

b

Computed with the participant-specific pitch floor (supplementary material A) set as the minimum pitch level.

c

Spectral magnitudes corrected using the methods of Iseli et al. (Ref. 30).

d

As implemented in Praat. See Fraile and Godino-Llorente (Ref. 38) for an analysis of implementation and algorithmic differences.

e

Bandwidths calculated using the method proposed by Hawks and Miller (Ref. 37).

f

The property given the notation nfo by Titze et al. (Ref. 8).

g

Spectral magnitudes corrected using the methods of Iseli et al. (Ref. 30).

h

A notation was not provided by Titze et al. (Ref. 8) We use LH(Fn) to distinguish the level harmonic close to a formant with the ordinal number n from the level of the formant (LFn) and the level of a numbered harmonic (Ln).

i

Implemented using the comb-liftering in the cepstral domain described by Krom. (Ref. 39)

In addition, each utterance was provided with a description of whether the utterance was perceived as prosodically rising, falling, or produced with a sustained tone by the phrase-final fo contour. A rising or falling phrase-final intonation was defined as an increase or decrease in at least 2 St as the final tonal movement of a phrase; an even phrase-final intonation was defined as a change of no more than 2 St.33,34 The final intonation shift35 was calculated as the difference between fo levels in the vowel in the last stressed syllable and the subsequent unstressed syllable. If the last syllable in the utterance was stressed, the fo change within this syllable was calculated instead. In rare cases, a change from unstressed words to the last stressed vowel was used as reference points for the fo measurement to align better with the perceptual impression of the tonal shift at the end of the phrase. Perceptual considerations were determined by consensus by two listeners who were advanced-level SLP students and native speakers of Swedish.

3. Statistical procedure

The acoustic description of each speaker's spontaneous speech (Table III) was combined with the femininity and masculinity ratings of the speaker provided by each of the 121 raters and normalized to a unit scale. The 15 972 perceptual evaluations of speakers' level of femininity or masculinity by listeners were split into one dataset used for training and tuning of models (75% training data) and a second set used only for the final evaluation of models (25% evaluation data). Two L1 penalized (LASSO) regression models were developed in the training dataset using separate cross-validation procedures, one for the model of the perceived level of femininity and one for the perceived level of masculinity. The penalization of linear models was employed to scale unimportant predictors and predictors low levels of unique contributions to zero (feature selection), and to guard against overfitting against the training data to ensure generalizability of the models. The model penalization parameters (ℓf and ℓm for femininity and masculinity models, respectively) were tuned in a tenfold cross-validation procedure. The minimized root mean squared (RMS) error of predictions of masculinity or femininity when compared with the true perceptual ratings of speakers in the holdout 1/10th of the data were used as a criterion for the selection of ℓf and ℓm. The ten computed models of femininity and masculinity perception were then averaged to derive the final models and evaluated in the 25% evaluation data not used in model training. The variable importance of acoustic predictors for the final models was computed by the Feature Importance Ranking Measure approach.40 

The average rating of femininity and masculinity for the 132 voice samples as rated by the 121 listeners is presented in Fig. 1. Perceptual ratings of voices showed an ICC(A,1) = 0.88 (CI95% = [0.86, 0.91]) for level of femininity and ICC(A,1) = 0.87 (CI95% = [0.84, 0.90]) for level of masculinity. A Pearson correlation analysis showed that the independently rated levels of femininity and masculinity in the voice samples were significantly correlated (r = –0.93 CI95% =  [–0.933, –0.924], p < 0.001). The R2 of a linear model of femininity rating as a function of the masculinity rating was 0.863.

FIG. 1.

The average independent rating of levels of femininity and masculinity in the 132 voices as rated by 121 listeners. Horizontal and vertical error bars indicate the 95% Gaussian confidence intervals. Dotted line indicates the sum of femininity and masculinity ratings added up to 1000.

FIG. 1.

The average independent rating of levels of femininity and masculinity in the 132 voices as rated by 121 listeners. Horizontal and vertical error bars indicate the 95% Gaussian confidence intervals. Dotted line indicates the sum of femininity and masculinity ratings added up to 1000.

Close modal

The best models were constructed using 223 acoustic predictors. The acoustic models of femininity and masculinity using the acoustic predictors in Table III and perceptual predictors of utterance final rising, falling, and level tones showed a R2f = 0.89 and R2m = 0.87 when evaluated in the 25% validation dataset. The models' prediction errors showed no clear pattern of dependence on the part of the masculinity and femininity scales asked to predict (Fig. 2). Supplementary material B presents the variable importance of acoustic properties that contributed to models' ability to predict the perceived level of femininity and masculinity in voice samples not used in model training. The top six predictors “mean fo of utterances,” “variability in smoothed CPP of utterances,” “median RMS amplitude of plosive release bursts,” “variability in LH(F2) in vowels in stressed syllables,” “variability in F4 and F1 differences in vowels in stressed syllables,” and “variability of F1 of vowels in stressed syllables,” were consistent across the femininity and masculinity models, with minor differences in relative importance. When evaluated separately in the validation set, these predictors together provided linear models of femininity and masculinity with R2f = 0.77 and R2m = 0.76, respectively (87% of the total explained variance of the models). Figure 3 indicates a strong association between the signed variable importance of predictors with non-zero coefficients obtained for the independently trained models of masculinity and femininity. One acoustic property (“variability of smoothed cepstral peak prominence in vowels in stressed syllables”) was indicated to increase both the perceived level of femininity and the perceived level of masculinity in the independently trained models. Two predictors (“median difference between LF5 and LF3 in vowels in stressed syllables” and “variability in median RMS amplitude of utterances”) provided small negative contributions to both independently trained acoustic models of femininity and masculinity. The level of correlation between each pair of the predictors is provided in supplementary material C. A total of 0.7% of measure comparisons in supplementary material C indicated an absolute correlation above 0.90 and 3% of comparisons showed an absolute correlation above 0.50.

FIG. 2.

(Color online) Errors in predictions (residuals) for the computed models of femininity (A) and masculinity (B) depending on the part of the respective scale at which the sample was rated.

FIG. 2.

(Color online) Errors in predictions (residuals) for the computed models of femininity (A) and masculinity (B) depending on the part of the respective scale at which the sample was rated.

Close modal
FIG. 3.

The variable importance of each predictor with non-zero coefficients in the regularized LASSO models of femininity and masculinity. Dotted line indicates where the variable importance in the model of femininity was the negative of the variable importance in the model of masculinity. The femininity and masculinity perception models were based on separate ratings of the same stimuli and were trained separately.

FIG. 3.

The variable importance of each predictor with non-zero coefficients in the regularized LASSO models of femininity and masculinity. Dotted line indicates where the variable importance in the model of femininity was the negative of the variable importance in the model of masculinity. The femininity and masculinity perception models were based on separate ratings of the same stimuli and were trained separately.

Close modal

The aim of this investigation was to determine which acoustic properties discussed in the literature make the most substantial individual contribution to a prediction of the perceived level of femininity and masculinity in voices. A voicebank expected to cover the full range of levels of femininity and masculinity in voices was developed and assessed in a blinded and randomized perceptual evaluation procedure. A comprehensive set of acoustic properties was extracted with speaker-specific settings for identifying fo and formants before applying them in derived measures. Two models of femininity and masculinity perception were developed from separate perceptual evaluations and in separate training procedures to ensure that findings can be generalized to a broader set of voices to a high degree.

While acoustic properties observed in the literature to contribute to the perception of increased levels of femininity or masculinity in voices have received attention in previous reports,7 this study is the first to evaluate which predictors make the most substantial independent contribution to statistical models. The overall explained variances of femininity and masculinity in voices were 89% and 87%, respectively. Thus, the results support the conclusion that the feature elimination that is a result of LASSO regularization managed to retain acoustic features of importance to the modeling of femininity and masculinity in voices not included in the training set. The separate modeling efforts of femininity and masculinity predominately identified the same predictors, but with opposite signs in variable importance. The most prominent acoustic features identified are well attested in the literature, and fo on a semitone scale was confirmed to be the main acoustic property by which listeners determined the perceived level of femininity and masculinity in the voice. Complementary acoustic predictors were found in other properties of the voice source, the interaction between frequencies and amplitudes of the harmonics and formants in vowels, and in the strength and dynamics of plosive release and aperiodic portions. Some additional perceptual cues to the perception of a level of femininity and masculinity were found in the spectral distribution of fricatives. Unfortunately, all fricative types were not well represented in the spontaneous speech samples that the participants chose to produce. As a result, the importance of spectral properties of fricatives may have been undervalued here. The high levels of explained variances in both models do, however, indicate that, while an increased understanding of the importance of spectral distributions of fricatives may have to be revisited in later research with more control over the spoken content, and ensured sufficient presence of fricatives, the models can still be viewed as acceptable representations of what primarily affects the perception of femininity and masculinity in voices. Finally, it is important to note that what was developed here was an optimal acoustic model of the acoustic predictors of the level of femininity and masculinity in the voice, with competing or duplicating information substituting for the most informative predictor. Clear examples of this are the “proportion of utterances with a falling final tonal movement” and “proportion of utterances with a rising final tonal movement.” The proportion of utterances with falling final tonal movement will be dependent on the proportion of rising, and flat, tonal movement since the sum of them is 1. Therefore, “proportion of utterances with a falling final tonal movement” and “proportion of utterances with a rising final tonal movement” are highly interchangeable once the small perceptual contribution of “proportion of utterances with a flat final tonal movement” has been established. There were further high levels of correlation between a small proportion of measures. Measures that are highly linked in these ways may therefore likely be interchanged if required by the context in which it is to be applied.

In this study, the listeners were asked to rate either the level of femininity or masculinity of a voice in independent, individually randomized procedures. Despite being designed to afford independent judgment of femininity and masculinity level in a voice, the perceptual assessments of voices show strong interdependence and an inverse relationship, and a simple linear model of the masculinity score as a function of the femininity score showed an 86.3% explained variance across all perceptual ratings. The ratings did not, however, show a strong pattern of being dichotomous overall. The data show, instead, that the listeners can place voices on a continuous femininity and masculinity scale with reasonable reliability and use the entire scale if provided with the appropriate voice samples.

The results presented here provide the best available acoustic model of what constitutes an auditory-perceptual cue to perceived femininity and masculinity in voices and suggest that the information can be used in gender-affirming voice training to guide acoustic properties to select as the basis for training evaluation based on a client-selected target voice. What is not, however, offered is guidance on how to achieve the specific property of a voice through voice and speech modification within the boundaries set up by the physical properties of the speech apparatus that a voice client is born with. The continued work in the evaluation of goal achievement in gender-affirming voice training would be greatly benefited by and provide additional external validation to the results and analyses presented here. Further, it should be kept in mind that, while acoustic data are important to researchers and clinicians, patient-reported outcomes are, as has recently been highlighted,12,17 the most important when evaluating gender affirming voice training. Merrit et al.11 have further recently suggested that gender prototypicality may be a dimension to consider in relation to perceptual data such as ours. We did not ask the listeners how prototypical they perceived voices and collected no data, such as reaction times, of sufficient reliability to serve as a proxy for this determination and cannot elucidate on this aspect of femininity and masculinity perception. We further note that our survey of acoustic properties guiding femininity and masculinity perceptions includes many properties not investigated by Merrit et al.11 It is important to recognize that acoustic parameters may probably never fully explain listener ratings related to gender because of its complex nature and speakers' association with different genders is also socially constructed and therefore subject to change.41 

While the goal of this study was to establish which acoustic features of the speech signal contribute the most to our perception of femininity and masculinity for use in guiding gender-affirming voice training, we suggest that the results can also be used in other domains. Speech and voice synthesis is a continuing evolving field, but when a particular level of femininity and masculinity is desired, past and recent efforts focus predominately on adjustment of the fundamental frequency of the voice source,13,15,42–45 of speaking amplitude,42 or of the simulated vocal tract length13,15,45,46 to achieve the desired result. How we perceive a computer-generated voice affects how it may interact with us;46–49 and we propose, in line with previous work,50 that the incorporation of more of the acoustic markers of femininity and masculinity in the voice into speech synthesis systems can lead to more acceptable, desirable, or trustworthy communication partners. Using a broader spectrum of acoustic properties may alleviate the need to exaggerate the salience of a few acoustic markers, such as an exaggerated fo or formant frequency spacing to clearly signal femininity. It may further be highlighted that more fine-tuned synthesis may reduce the risk of inadvertent encoding of unbalanced gender stereotypes in human–computer interactions.51,52 Methods for incorporating acoustic properties identified here into speech synthesis is, however, beyond the scope of the present work and should be the focus of further research.

While the acoustic analysis presented here was based on the collected literature on the perception of femininity and masculinity internationally, we acknowledge that some aspects of the acoustic models may be language-specific or only generalizable to some grouping of languages close to spoken Swedish. More research is needed to elucidate this issue.

This study identified 223 acoustic properties that described 89% and 87% of variances in perceived levels of femininity and masculinity in voices, respectively. While the mean fundamental frequency of oscillation (fo) is the most prominent cue to the perceived level of femininity and masculinity in voices, and the top six properties accounting for 87% of the total explained variance, this study confirmed that listeners' perceptions are described more accurately when complementary information from other acoustic cues is added to the model. The developed models are proposed to afford communication and evaluation of gender-affirming voice training goals and improve voice synthesis efforts.

See the supplementary material A for a description of parameters used for each speaker when extracting acoustic properties using Praat, supplementary material B for variable importance of the acoustic properties identified as important predictors of perceived level of femininity and masculinity, and supplementary material C for the level of pairwise (Pearson) correlations between acoustic predictors.

The authors received funding from internal grants given by the Department of Clinical Sciences, Umeå University, and partial funding for a Ph.D. education (to J.H.) from Umeå Center for Gender Studies for this study. The authors thank the participants for donating their time for the listening experiment and Speech-language pathology students and research assistants Emelie Granlund, Amanda Nyberg, Elenor Anderson Stål, Julia Häger, Felicia Gregory, Cecilia Mellberg, and Anna-Karin Sparre for their assistance in making speech recordings and performing the manual markup. We thank Svante Granqvist for suggesting a notation for harmonics near a formant of interest. The technical support of the Visible Speech platform developed as part of the Swedish national research infrastructure Språkbanken and Swe-Clarin, funded jointly by the Swedish Research Council (2018-2024, Contract 2017-00626) and the 10 participating partner institutions, is gratefully acknowledged.

The authors declare no conflict of interest.

This study was approved by the National Ethical Review Authority (Case No. 2019-05374). Informed consent was obtained from all subjects involved in the study before being audio recorded.

Under national law, speech recordings are considered personally identifiable information and cannot be distributed to other research groups or the public. Derived signals are made available upon request.

1.
E.
Coleman
,
A. E.
Radix
,
W. P.
Bouman
,
G. R.
Brown
,
A. L. C.
de Vries
,
M. B.
Deutsch
,
R.
Ettner
,
L.
Fraser
,
M.
Goodman
,
J.
Green
,
A. B.
Hancock
,
T. W.
Johnson
,
D. H.
Karasic
,
G. A.
Knudson
,
S. F.
Leibowitz
,
H. F. L.
Meyer-Bahlburg
,
S. J.
Monstrey
,
J.
Motmans
,
L.
Nahata
,
T. O.
Nieder
,
S. L.
Reisner
,
C.
Richards
,
L. S.
Schechter
,
V.
Tangpricha
,
A. C.
Tishelman
,
M. A. A.
Van Trotsenburg
,
S.
Winter
,
K.
Ducheny
,
N. J.
Adams
,
T. M.
Adrián
,
L. R.
Allen
,
D.
Azul
,
H.
Bagga
,
K.
Başar
,
D. S.
Bathory
,
J. J.
Belinky
,
D. R.
Berg
,
J. U.
Berli
,
R. O.
BluebondLangner
,
M.-B.
Bouman
,
M. L.
Bowers
,
P. J.
Brassard
,
J.
Byrne
,
L.
Capitán
,
C. J.
Cargill
,
J. M.
Carswell
,
S. C.
Chang
,
G.
Chelvakumar
,
T.
Corneil
,
K. B.
Dalke
,
G.
De Cuypere
,
E.
de Vries
,
M.
Den Heijer
,
A. H.
Devor
,
C.
Dhejne
,
A.
D’Marco
,
E. K.
Edmiston
,
L.
Edwards-Leeper
,
R.
Ehrbar
,
D.
Ehrensaft
,
J.
Eisfeld
,
E.
Elaut
,
L.
Erickson-Schroth
,
J. L.
Feldman
,
A. D.
Fisher
,
M. M.
Garcia
,
L.
Gijs
,
S. E.
Green
,
B. P.
Hall
,
T. L. D.
Hardy
,
M. S.
Irwig
,
L. A.
Jacobs
,
A. C.
Janssen
,
K.
Johnson
,
D. T.
Klink
,
B. P. C.
Kreukels
,
L. E.
Kuper
,
E. J.
Kvach
,
M. A.
Malouf
,
R.
Massey
,
T.
Mazur
,
C.
McLachlan
,
S. D.
Morrison
,
S. W.
Mosser
,
P. M.
Neira
,
U.
Nygren
,
J. M.
Oates
,
J.
Obedin-Maliver
,
G.
Pagkalos
,
J.
Patton
,
N.
Phanuphak
,
K.
Rachlin
,
T.
Reed
,
G. N.
Rider
,
J.
Ristori
,
S.
Robbins-Cherry
,
S. A.
Roberts
,
K. A.
Rodriguez-Wallberg
,
S. M.
Rosenthal
,
K.
Sabir
,
J. D.
Safer
,
A. I.
Scheim
,
L. J.
Seal
,
T. J.
Sehoole
,
K.
Spencer
,
C.
St. Amand
,
T. D.
Steensma
,
J. F.
Strang
,
G. B.
Taylor
,
K.
Tilleman
,
G. G.
T’Sjoen
,
L. N.
Vala
,
N. M.
Van Mello
,
J. F.
Veale
,
J. A.
Vencill
,
B.
Vincent
,
L. M.
Wesp
,
M. A.
West
, and
J.
Arcelu
, “
Standards of care for the health of transgender and gender diverse people, version 8
,”
Int. J. Transgender Health
23
,
S1
S259
(
2022
).
2.
C.
Leyns
,
T.
Papeleu
,
P.
Tomassen
,
G.
T'Sjoen
, and
E.
D'Haeseleer
, “
Effects of speech therapy for transgender women: A systematic review
,”
Int. J. Transgender Health
22
,
360
380
(
2021
).
3.
J.
Oates
and
G.
Dacakis
, “
Transgender voice and communication: Research evidence underpinning voice intervention for male-to-female transsexual women, SIG 3
,”
Perspect. Voice Voice Dis.
25
,
48
58
(
2015
).
4.
J.
Holmberg
,
I.
Linnander
,
M.
Södersten
, and
F.
Karlsson
, “
Exploring motives and perceived barriers for voice modification: The views of transgender and gender diverse voice clients
,”
J. Speech Hear. Res.
66
(
7
),
2246
2259
(
2023
).
5.
L.
Zimman
, “
Voices in transition: Testosterone, transmasculinity, and the gendered voice among female-to-male transgender people
,” Ph.D. dissertation (
University of Colorado, Boulder, CO
,
2012
).
6.
V.
Papp
,
The Female-to-Male Transsexual Voice: Physiology vs. Performance in Production
(
Rice University
,
Houston, TX
,
2011
).
7.
Y.
Leung
,
J.
Oates
, and
S. P.
Chan
, “
Voice, articulation, and prosody contribute to listener perceptions of speaker gender: A systematic review and meta-analysis
,”
J. Speech Hear. Res.
61
,
266
297
(
2018
).
8.
I. R.
Titze
,
R. J.
Baken
,
K. W.
Bozeman
,
S.
Granqvist
,
N.
Henrich
,
C. T.
Herbst
,
D. M.
Howard
,
E. J.
Hunter
,
D.
Kaelin
,
R. D.
Kent
,
J.
Kreiman
,
M.
Kob
,
A.
Löfqvist
,
S.
McCoy
,
D. G.
Miller
,
H.
Noé
,
R. C.
Scherer
,
J. R.
Smith
,
B. H.
Story
,
J. G.
Švec
,
S.
Ternström
, and
J.
Wolfe
, “
Toward a consensus on symbolic notation of harmonics, resonances, and formants in vocalization
,”
J. Acoust. Soc. Am.
137
,
3005
3007
(
2015
).
9.
N.
Houle
,
M. P.
Lerario
, and
S. V.
Levi
, “
Spectral analysis of strident fricatives in cisgender and transfeminine speakers
,”
J. Acoust. Soc. Am.
154
,
3089
3100
(
2023
).
10.
Y.
Leung
,
J.
Oates
,
S.-P.
Chan
, and
V.
Papp
, “
Associations between speaking fundamental frequency, vowel formant frequencies, and listener perceptions of speaker gender and vocal femininity–masculinity
,”
J. Speech Hear. Res.
64
,
2600
2622
(
2021
).
11.
B.
Merritt
,
T.
Bent
,
R.
Kilgore
, and
C.
Eads
, “
Auditory free classification of gender diverse speakers
,”
J. Acoust. Soc. Am.
155
,
1422
1436
(
2024
).
12.
M.
Södersten
,
J.
Oates
,
A.
Sand
,
S.
Granqvist
,
S.
Quinn
,
G.
Dacakis
, and
U.
Nygren
, “
Gender-affirming voice training for trans women: Acoustic outcomes and their associations with listener perceptions related to gender
,”
J. Voice
(published online
2024
).
13.
M.
Hope
and
J.
Lilley
, “
Gender expansive listeners utilize a non-binary, multidimensional conception of gender to inform voice gender perception
,”
Brain Lang.
224
,
105049
(
2022
).
14.
V. I.
Wolfe
,
D. L.
Ratusnik
,
F. H.
Smith
, and
G.
Northrop
, “
Intonation and fundamental frequency in male-to-female transsexuals
,”
J. Speech Hear. Disord.
55
,
43
50
(
1990
).
15.
B.
Merritt
and
T.
Bent
, “
Revisiting the acoustics of speaker gender perception: A gender expansive perspective
,”
J. Acoust. Soc. Am.
151
,
484
499
(
2022
).
16.
D. H. da C.
Martinho
and
A. C.
Constantini
, “
Auditory-perceptual assessment and acoustic analysis of gender expression in the voice
,”
J. Voice
(published online
2024
).
17.
J.
Oates
,
M.
Södersten
,
S.
Quinn
,
U.
Nygren
,
G.
Dacakis
,
V.
Kelly
,
G.
Smith
, and
A.
Sand
, “
Gender-affirming voice training for trans women: Effectiveness of training on patient-reported outcomes and listener perceptions of voice
,”
J. Speech Hear. Res.
66
(
11
),
4206
4235
(
2023
).
18.
R. R.
Patel
,
S. N.
Awan
,
J.
Barkmeier-Kraemer
,
M.
Courey
,
D.
Deliyski
,
T.
Eadie
,
D.
Paul
,
J. G.
Švec
, and
R.
Hillman
, “
Recommended protocols for instrumental assessment of voice: American Speech-Language-Hearing Association Expert Panel to develop a protocol for instrumental assessment of vocal function
,”
Am. J. Speech Lang. Pathol.
27
,
887
905
(
2018
).
19.
J. G.
Švec
and
S.
Granqvist
, “
Tutorial and guidelines on measurement of sound pressure level in voice and speech
,”
J. Speech Hear. Res.
61
(
3
),
441
461
(
2018
).
20.
J. G.
Švec
and
S.
Granqvist
, “
Guidelines for selecting microphones for human voice production research
,”
Am. J. Speech Lang. Pathol.
19
,
356
368
(
2010
).
21.
S.
Granqvist
, “Sopran [computer program],” http://www.tolvan.com.
22.
S.
Schelinski
and
K.
von Kriegstein
, “
The relation between vocal pitch and vocal emotion recognition abilities in people with autism spectrum disorder and typical development
,”
J. Autism Dev. Disord.
49
,
68
82
(
2019
).
23.
J.
Peirce
,
J. R.
Gray
,
S.
Simpson
,
M.
MacAskill
,
R.
Höchenberger
,
H.
Sogo
,
E.
Kastman
, and
J. K.
Lindeløv
, “
PsychoPy2: Experiments in behavior made easy
,”
Behav. Res. Methods
51
,
195
203
(
2019
).
24.
D.
Bridges
,
A.
Pitiot
,
M. R.
MacAskill
, and
J. W.
Peirce
, “
The timing mega-study: Comparing a range of experiment generators, both lab-based and online
,”
PeerJ.
8
,
e9414
(
2020
).
25.
A.
Anwyl-Irvine
,
E. S.
Dalmaijer
,
N.
Hodges
, and
J. K.
Evershed
, “
Realistic precision and accuracy of online experiment platforms, web browsers, and devices
,”
Behav. Res.
53
,
1407
1425
(
2021
).
26.
K. O.
McGraw
and
S. P.
Wong
, “
Forming inferences about some intraclass correlation coefficients
,”
Psychol. Methods
1
,
30
46
(
1996
).
27.
J.
Kirby
, Praatsauce: Praat-based tools for spectral analysis (
2018
), https://github.com/kirbyj/praatsauce.
28.
P.
Boersma
and
D.
Weenink
, “
Praat: Doing phonetics by computer [computer program]
,” http://www.praat.org/.
29.
R.
Winkelmann
,
J.
Harrington
, and
K.
Jänsch
, “
EMU-SDMS: Advanced speech database management and analysis in R
,”
Comput. Speech Lang.
45
,
392
410
(
2017
).
30.
M.
Iseli
,
Y.-L.
Shue
, and
A.
Alwan
, “
Age, sex, and vowel dependencies of acoustic measures related to the voice source
,”
J. Acoust. Soc. Am.
121
,
2283
2295
(
2007
).
31.
H.
Traunmüller
, “
Analytical expressions for the tonotopic sensory scale
,”
J. Acoust. Soc. Am.
88
,
97
100
(
1990
).
32.
H. K.
Schutte
and
W.
Seidner
, “
Recommendation by the Union of European Phoniatricians (UEP): Standardizing voice area measurement/phonetography
,”
Folia Phoniatr. Logop.
35
,
286
288
(
1983
).
33.
A.
Hancock
,
L.
Colton
, and
F.
Douglas
, “
Intonation and gender perception: Applications for transgender speakers
,”
J. Voice
28
,
203
209
(
2014
).
34.
M. P.
Gelfer
and
K. J.
Schofield
, “
Comparison of acoustic and perceptual measures of voice in male-to-female transsexuals perceived as female versus those perceived as male
,”
J. Voice
14
,
22
33
(
2000
).
35.
C.
Leyns
,
T.
Papeleu
,
T.
Feryn
,
S. D.
Baer
,
K.
Bettens
,
P.
Corthals
, and
E.
D'Haeseleer
, “
Age and gender differences in Belgian Dutch intonation
,”
J. Voice
38
,
801.e1
801.e26
(
2022
).
36.
W.
Klein
,
R.
Plomp
, and
L. C. W.
Pols
, “
Vowel spectra, vowel spaces, and vowel identification
,”
J. Acoust. Soc. Am.
48
,
999
1009
(
1970
).
37.
J. W.
Hawks
and
J. D.
Miller
, “
A formant bandwidth estimation procedure for vowel synthesis
,”
J. Acoust. Soc. Am.
97
,
1343
1344
(
1995
).
38.
R.
Fraile
and
J. I.
Godino-Llorente
, “
Cepstral peak prominence: A comprehensive analysis
,”
Biomed. Signal Proces.
14
,
42
54
(
2014
).
39.
G.
de Krom
, “
A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals
,”
J. Speech Hear. Res.
36
,
254
266
(
1993
).
40.
A.
Zien
,
N.
Krämer
,
S.
Sonnenburg
, and
G.
Rätsch
, The Feature Importance Ranking Measure (
Springer
,
Berlin
,
2009
), pp.
694
709
.
41.
D.
Azul
, “
Transmasculine people's vocal situations: A critical review of gender‐related discourses and empirical data
,”
Intl. J. Lang. Commun. Disord.
50
,
31
47
(
2015
).
42.
I.
Karlsson
, “
Female voices in speech synthesis
,”
J. Phon.
19
,
111
120
(
1991
).
43.
A.
Powers
,
A. D. I.
Kramer
,
S.
Lim
,
J.
Kuo
,
S.-L.
Lee
, and
S.
Kiesler
, “
Eliciting information from people with a gendered humanoid robot
,”
ROMAN 2005. IEEE Int. Workshop Robot Human Interactive Communication
(
2005
).
44.
M.
Tremaine
,
E. J.
Lee
,
C.
Nass
, and
S.
Brave
, “
Can computer-generated speech have gender
?” in
CHI ‘00 Extended Abstracts on Human Factors in Computing Systems
(
2000
).
45.
X.
Chen
,
R.
Wang
,
A.
Khalilian-Gourtani
,
L.
Yu
,
P.
Dugan
,
D.
Friedman
,
W.
Doyle
,
O.
Devinsky
,
Y.
Wang
, and
A.
Flinker
, “
A neural speech decoding framework leveraging deep learning and speech synthesis
,”
Nat. Mach. Intell.
6
,
467
480
(
2024
).
46.
F.
Efthymiou
,
C.
Hildebrand
,
E.
de Bellis
, and
W. H.
Hampton
, “
The power of AI-generated voices: How digital vocal tract length shapes product congruency and ad performance
,”
J. Interact. Mark.
59
,
117
134
(
2024
).
47.
P.
Mirenda
,
D.
Eicher
, and
D. R.
Beukelman
, “
Synthetic and natural speech preferences of male and female listeners in four age groups
,”
J. Speech Hear. Res.
32
,
175
183
(
1989
).
48.
J. W.
Mullennix
,
S. E.
Stern
,
S. J.
Wilson
, and
C.
Dyson
, “
Social perception of male and female computer synthesized speech
,”
Comput. Hum. Behav.
19
,
407
424
(
2003
).
49.
D. R.
Feinberg
,
L. M.
DeBruine
,
B. C.
Jones
, and
D. I.
Perrett
, “
The role of femininity and averageness of voice pitch in aesthetic judgments of women's voices
,”
Perception
37
,
615
623
(
2005
).
50.
I.
Karlsson
, “
Modelling voice variations in female speech synthesis
,”
Speech Commun.
11
,
491
495
(
1992
).
51.
C.
Nass
,
Y.
Moon
, and
N.
Green
, “
Are machines gender neutral? Gender‐stereotypic responses to computers with voices
,”
J. Appl. Soc. Pyschol.
27
,
864
876
(
1997
).
52.
E.
Lee
, “
Gender stereotyping of computers: Resource depletion or reduced attention?
J. Commun.
58
,
301
320
(
2008
).

Supplementary Material