There has been consistent interest among speech signal processing researchers in the accurate estimation of the fundamental frequency ($F0$) of speech signals. This study examines ten $F0$ estimation algorithms (some well-established and some proposed more recently) to determine which of these algorithms is, on average, better able to estimate $F0$ in the sustained vowel /a/. Moreover, a robust method for adaptively weighting the estimates of individual $F0$ estimation algorithms based on quality and performance measures is proposed, using an adaptive Kalman filter (KF) framework. The accuracy of the algorithms is validated using (a) a database of 117 synthetic realistic phonations obtained using a sophisticated physiological model of speech production and (b) a database of 65 recordings of human phonations where the glottal cycles are calculated from electroglottograph signals. On average, the sawtooth waveform inspired pitch estimator and the nearly defect-free algorithms provided the best individual $F0$ estimates, and the proposed KF approach resulted in a ∼16% improvement in accuracy over the best single $F0$ estimation algorithm. These findings may be useful in speech signal processing applications where sustained vowels are used to assess vocal quality, when very accurate $F0$ estimation is required.

## I. INTRODUCTION

The estimation of the fundamental frequency ($F0$) is a critical problem in the acoustic characterization of speech signals.^{1} For example, it is found in speech coding in communications, automatic speaker recognition, analysis of speech perception, and in the assessment of speech disorders.^{2} Typically, $F0$ is evaluated over short-term time intervals and the time course of the $F0$ values over the entire speech signal is known as $F0$ time series (or $F0$ *contour*). The existing difficulties in accurate $F0$ estimation are well reported in the speech signal processing literature, with an excellent summary found in a study authored by Talkin.^{3} According to Talkin these difficulties include:^{3} (1) $F0$ is time-varying, and may change between vocal cycles, (2) sub-harmonics (components of a waveform whose frequency is an integer fraction of the $F0$) appear frequently, (3) $F0$ may vary widely over successive vocal cycles, although often large $F0$ variations are assumed to be artifacts of the estimation algorithm because such abrupt changes seem fairly rare, (4) vocal tract resonances affect the vocal folds (that is, there is feedback from the vocal tract to the vocal folds^{2}) resulting in harmonics which are multiples of the actual $F0$, (5) it is difficult to estimate $F0$ at voice onset and offset (due to transient effects), (6) there is considerable inter-observer variability on the actual values of $F0$, and (7) periodic background noise might be challenging to differentiate from breathy voiced speech (their spectra may be similar). Additional problems include differentiating between voiced and unvoiced segments of speech, and specific cases which are very hard to deal with (e.g., where the signal is of extremely short duration).^{3}

A related task to $F0$ estimation is the determination of *pitch*, which is the psycho-acoustic equivalent of $F0$. We emphasize that the focus of this study is $F0$ estimation. Some researchers often use the terms *pitch detection algorithm* (PDA) and $F0$ *estimation algorithm* interchangeably; strictly speaking, PDA is a misnomer because pitch is inherently a continuous phenomenon and estimating the fundamental frequency of a signal is not a detection problem. For this reason we will only use the expression “$F0$ estimation algorithm” to refer to the algorithms described in this study.

The assessment of vocal performance is typically achieved using either sustained vowel phonations or running speech.^{2} Clinical practice has shown that the use of sustained vowels, which avoids articulatory and other confounding factors in running speech, is very practical and sufficient for the assessment of general vocal performance; we refer to Titze and references therein for more details.^{2} In voice quality assessment sessions, subjects are often requested to produce the open back unrounded vowel /a/ at a comfortable pitch for as long and as steadily as possible.^{2} This vowel provides an open vocal tract configuration where the mouth is maximally open compared to other vowels, which minimizes the reflected air pulse back to the vocal folds (therefore, there is low acoustic interaction between the vocal folds and the vocal tract).^{2} Using the sustained vowel /a/ instead of running speech alleviates some of the difficulties highlighted previously, by avoiding (a) the need to characterize frames (segments of the original speech signal, usually pre-specified with a duration of a few milliseconds) as voiced or unvoiced, (b) reducing the range of possible $F0$ values, and (c) minimizing the possible masking effects formants may have on $F0$ during running speech (for example, when the formants of a word complicate the identification of $F0$ because they may match its multiples—a problem often referred to as pitch halving or pitch doubling).^{2}

Roark^{4} highlighted the existence of more than 70 algorithms to estimate $F0$, which reflects both the importance and difficulty of the problem. Roark emphasized that there is no simple definition of $F0$ if it does not just refer to the period, and demonstrated that simple disturbances in the parameters of typical $F0$ estimation algorithms may lead to divergent results. Overall, as Talkin suggests,^{3} it is probably impossible to find a universally optimal $F0$ estimation algorithm for all applications. Some $F0$ estimation algorithms may be better suited to particular applications, depending on the type of speech signals (e.g., conversational signals or singing); computational considerations may also need to be considered (for example, in speech coding applications).

Research comparing the accuracy of different $F0$ estimation algorithms is not new in the speech literature.^{5–8} However, most of these comparative studies focused on healthy, or mildly dysphonic voices. For example, Titze and Liang^{6} studied three $F0$ estimation algorithms when perturbations in $F0$ were lower than 5%. Parsa and Jamieson^{5} were the first to investigate the performance of various $F0$ estimation algorithms in the presence of vocal disorders, a topic which has received comparatively little attention because the potentially fraught task of accurately determining $F0$ is exacerbated in vocal disorders.^{2} Parsa and Jamieson^{5} ran a series of experiments to investigate the accuracy of $F0$ estimation algorithms in determining the $F0$ of the sustained vowel /a/. They produced synthetic signals using a stylized model which attempted to simulate the main characteristics of the vocal production mechanism generating the sustained vowel /a/. This simple model does not closely represent physiologically plausible characteristics of voice pathologies, as it is based on linear filtering of a series of impulses with added noise and perturbations. Furthermore, many more sophisticated $F0$ estimation algorithms have been proposed since the publication of Parsa and Jamieson's study in 1999.^{5} More recently (2007), Jang *et al.*^{7} compared seven $F0$ estimation algorithms in pathological voices using the sustained vowel /a/, where the ground truth $F0$ time series was obtained manually. However, the $F0$ estimation algorithms investigated by Jang *et al.*^{7} do not reflect contemporary advances (the two most recent $F0$ estimation algorithms in that study were proposed in 1993 and 2002).

Some studies have evaluated the performance of software tools in accurately estimating the ground truth *jitter* ($F0$ perturbations), which can be considered a proxy for the estimation of $F0$, see, for example, Manfredi *et al.*^{8} One problem with this approach is that jitter lacks an unequivocal mathematical definition;^{2} another is that the time windows (reference time instances) used by each algorithm to obtain the $F0$ time series may differ, which complicates the interpretation of the results. Moreover, as Parsa and Jamieson^{5} correctly argued, it is possible to have the same jitter values for different $F0$ time series: in other words, there is no unique mapping from jitter to $F0$ time series. See also the extended criticism by Ferrer *et al.*^{9} Manfredi *et al.*^{8} synthesized ten sustained vowel /a/ phonations with a physiologically plausible model and compared four $F0$ estimation algorithms in their ability to detect jitter. Although this methodology can provide a general impression of the accuracy of $F0$ estimation, we agree with Parsa and Jamieson^{5} and Ferrer *et al.*^{9} that assessing jitter does not directly quantify the accuracy of $F0$ estimation, and should be avoided when comparing the performance of the $F0$ estimation algorithms. Moreover, compared to the study of Manfredi *et al.*^{8} we examine a considerably larger database of speech signals, and a more comprehensive set of $F0$ estimation algorithms.

The motivation for this study comes from our research on objective quantification of voice disorders using speech signal processing algorithms (*dysphonia measures*) to process sustained vowel /a/ phonations.^{10–13} Since disordered voices may be highly aperiodic or even stochastic,^{2} the task of $F0$ estimation algorithms is further complicated because some algorithms rely heavily on periodicity assumptions and their performance is known to degrade in the presence of noise.^{5} The dysphonia measures we typically investigate include $F0$ perturbation (jitter variants),^{10} and some dysphonia measures which explicitly require $F0$ estimates as an input;^{2} we refer to Tsanas *et al.*^{10} and references therein for algorithmic details. Thus, it can be inferred that those dysphonia measures which rely on $F0$ estimates would benefit from accurate $F0$ data.^{10} Moreover, researchers have attributed, at least partly, the success of some dysphonia measures to the fact that they quantify properties of the signal without requiring prior computation of $F0$ estimates.^{10,11,14} We clarify that although our main research interests are in pathological voice assessment, the aim of the present study is more general: obtaining accurate $F0$ estimates can be beneficial in many diverse applications which rely on speech signal processing.^{1,2} Therefore, the $F0$ estimates computed here are not intended to be used to compute any dysphonia measures.

Newly proposed $F0$ estimation algorithms have been validated in the following scenarios: (a) $F0$ values have been provided by expert speech scientists following visual inspection of the glottal cycles from plots of the signal, (b) using electroglottography (EGG) (a device placed externally to the larynx records EGG, and the glottal cycles are detected from the EGG signal), and (c) using synthetic signals where the ground truth $F0$ values are known in advance. All these validation approaches have been used to assess the performance of $F0$ estimation algorithms, but each approach has its limitations. First, speech experts observing a plot of a signal often do not agree on the exact length of each vocal period,^{3} and hence it is not clear how to define the ground truth unambiguously. Similarly, EGGs may provide faulty estimates of $F0$ (particularly for pathological voices) which are often corrected manually, casting doubt on the validity of this approach.^{15,16} Therefore, we argue that the third approach, using synthetic signals where the ground truth is known in advance, may be the most appropriate method for establishing the accuracy of $F0$ estimation algorithms, if signals that closely resemble actual speech signals can be generated. The ability to accurately replicate disordered voice signals is related to the nature of the model used to synthesize the signals, and its capacity to mimic the origin and effects of different voice disorders. On the other hand, it could be argued that the physiological speech production model might not be able to adequately express some diverse characteristics which appear in actual speech signals, or that the error caused by the speech production model may be more severe than the errors in the EGG method. This is because, in general, physiological models attempt to develop a mathematical framework to replicate the observed data, and therefore are inherently limited both by the finiteness and measurement error of the collected data (due to sources of physiological and environmental variability that affect data recorded in real-world experiments), and also the mathematical assumptions used in the model. Hence, in practice it may be useful to also investigate a database with actual signals where simultaneous EGG recordings are available.

In this study, we use both realistic synthetic signals where the ground truth $F0$ is exactly known, and also a database with actual speech signals where the ground truth $F0$ is derived by simultaneous EGG measurements. The physiological model of speech production generated realistic sustained vowel /a/ signals where the $F0$ values are determined from the glottal closure instants, i.e., vocal fold collision instants. If there is any type of voicing, the minimum glottal area signal (even under incomplete closure) captures all relevant physical interactions (tissue dynamics, airflow, and acoustics), and determines the periodicity of the speech signal.^{17} This is a more stable and reliable approach than using just the glottal airflow or radiated acoustic pressure at the lips because in those cases many additional components can impede the $F0$ estimation process (e.g., added harmonic components due to acoustic coupling, noise, and other acoustic sources). Specifically, we used a numerical lumped-mass model which was described in detail by Zañartu.^{18} The model was capable of mimicking various normal, hyper-functional (inappropriate patterns of vocal behavior that are likely to result in organic voice disorders) and pathological voices, where the exact system fluctuations were known.

The aim of this study is twofold: (a) to explore the accuracy of ten established $F0$ estimation algorithms (most of which were relatively recently proposed) in estimating $F0$ in both healthy and disordered voices and (b) to investigate the potential of *combining* the outputs of the $F0$ estimation algorithms aimed at exploiting the best qualities from each, and improve $F0$ estimates. With the exception of a simple combination of three $F0$ estimation algorithms,^{10} we are not aware of any systematic investigation into combining the outputs of $F0$ estimation algorithms in the speech literature. The combination of the $F0$ estimation algorithms can take place in a *supervised learning* setting and is known as *ensemble learning* in the statistical machine learning literature. Alternatively, the combination of information from various *sources* (here the $F0$ estimation algorithms) in an *unsupervised learning* setting is known as *information fusion* (or *data fusion*). Ensemble learning and information fusion are particularly successful in contexts where different methods capture different characteristics of the data, and have shown great promise in diverse applications.^{19,20} In this study, we extend a recently proposed information fusion framework, which relies on the adaptive Kalman filter (KF) and algorithmic robustness metrics, to weigh the $F0$ estimates from each of the ten $F0$ estimation algorithms. We demonstrate the adaptive KF fusion framework for estimating $F0$ outperforms, on average, the single best $F0$ estimation algorithm. Furthermore, we demonstrate the KF fusion approach provides robust and accurate estimates for both noisy and low sampling frequency speech signals (conditions which cause considerable performance degradation in terms of accurate $F0$ estimation for most $F0$ estimation algorithms).

The paper is organized as follows. In Sec. II we describe the data used in this study, including a brief description of the physiological model which was used to generate the simulated phonations. In Sec. III we review the $F0$ estimation algorithms used in this study, and describe in detail the information fusion scheme that combines the individual algorithms. Section IV compares the performance of the $F0$ estimation algorithms (both individually and their combinations). Finally, Sec. V summarizes the main findings, outlines the limitations of the current approach, and suggests potential areas of interest for future research.

## II. DATA

### A. Synthetic data: Model used to generate sustained vowel /a/ signals and computation of ground truth *F*_{0} time series

The physiological model used to generate the sustained vowel /a/ signals was described in detail by Zañartu;^{18} here we summarize the mechanisms. This physiological model is an extended version of the original body-cover model of the vocal folds by Story and Titze,^{21} and allows a realistic generation of normal and pathological voices. Asymmetric vibration of the vocal folds was controlled by a single factor proposed by Steinecke and Herzel^{22} for modeling superior nerve paralysis. The material properties of the vocal folds and their dependence on muscle activation followed Titze and Story,^{23} with an extension to include neural fluctuations that affect muscle activity. These fluctuations were modeled as a zero-mean, Gaussian white noise signal. They were processed by a low-pass, finite-impulse response filter. The flow model incorporated the effects of asymmetric pressure loading.^{24} The airflow solver allowed for interactions with the surrounding sound pressures at the glottis and the inclusion of incomplete glottal closure from a posterior glottal opening. This model of incomplete glottal closure enhanced the ability to represent voiced speech, as it is commonly observed in both normal and pathological voices.^{25} The effects of organic pathologies (e.g., polyps and nodules) were modeled as described by Kuo,^{26} including an additional component to reduce the vocal fold contact.^{24} Sound propagation was simulated using waveguide models of the supraglottal and subglottal tracts, with waveguide geometries determined from previous studies.^{27} In addition, the wave reflection model included the mouth radiation impedance and different loss factors for the subglottal and supraglottal tracts, which allowed for nonlinear interactions between the vocal folds and the vocal tract, and also affected the vocal fold dynamics.^{28} A time step corresponding to a sampling frequency of 44.1 kHz was used in a fourth order Runge–Kutta ordinary differential equation solver.

Each simulation produced 1 s of voiced speech uttering a sustained vowel /a/, where initial transients (typically about four periods) do not provide reliable information regarding the oscillating pattern of the vocal folds (until the model reaches a stable state depending on the initial conditions). To ensure that the ground truth is reliable, the initial 50 ms of each signal were discarded from further analysis. In total, 125 sustained vowel /a/ signals were generated. Cases which resulted in unnatural-sounding voices (following aural inspection by A.T.) were removed before any analysis. Thus, we processed 117 signals which were used to evaluate the performance of the $F0$ estimation algorithms. The period of each cycle was computed from the instant the vocal folds begin to separate, after vocal fold collision was present (if any) or immediately after the glottal area was minimized (in cases where no vocal fold collision took place). The distributions of the ground truth $F0$ values for all signals are summarized in Fig. 1, which presents the median and the interquartile range values for each speech signal. We remark that the speech signals were generated over a relatively wide range of possible $F0$ values, with variable $F0$ fluctuations (jitter). Care was taken to generate signals using a large range of average $F0$ for each phonation (60–220 Hz), including 20 signals with low $F0$ (<100 Hz), because recent research suggests such phonations are notoriously difficult for most of the commonly used $F0$ estimation algorithms.^{29}

The synthetic speech signals are available on request by contacting the first author.

### B. Database with actual speech signals and computation of *F*_{0} based on EGG

We used a database consisting of 65 sustained vowel /a/ phonations from 14 subjects diagnosed with Parkinson's disease (PD). They all had typical PD voice and speech characteristics as determined by an experienced speech-language pathologist, i.e., reduced loudness, monotone, breathy, hoarse voice, or imprecise articulation. The subjects' enrolment in this study and all recruiting materials were approved by an independent institutional review board. The 14 PD subjects (8 males, 6 females), had an age range of 51 to 69 years (mean ± standard deviation: 61.9 ± 6.5 years). They were instructed to produce phonations in three tasks regarding pitch: comfortable pitch, high pitch, and low pitch, subjectively determined by each subject. The sustained vowel phonations were recorded using a head-mounted microphone (DACOMEX-059210, which is omnidirectional and has a flat frequency response with a bandwidth of 20 to 20 kHz) in a double-walled, sound-attenuated room. The voice signals were amplified using the M-Audio Mobile Pre model and sampled at 44.1 kHz with 16 bits of resolution (using the Tascam US-122mkII A/D converter). The data were recorded using a Sony Vaio computer which had an Intel display audio and Conexant 20672 SmartAudio HD device (high frequency cut-off 20 kHz). Simultaneously with the recording of the sustained vowels, EGGs were recorded using the VoceVista model. The glottal cycles were automatically determined using the EGGs with the sigma algorithm,^{30} which almost always correctly identifies the true vocal cycles. Visual inspection of the signals and their associated EGGs verified that the sigma algorithm was indeed very accurate at determining the vocal cycles.

## III. METHODS

This section is comprised of (a) a review of ten widely used $F0$ estimation algorithms which were tested in this study, (b) a description of a novel combination scheme using the outputs of multiple $F0$ estimation algorithms, and (c) a description of the framework for validating the $F0$ estimation algorithms. All the simulations and computations were performed using the matlab software package, although in some cases interfaces to other programs were used [for example, to access praat (Ref. 31) which is described in Sec. III A 2].

### A. *F*_{0} estimation algorithms

Overall, there may be no single best $F0$ estimation algorithm for *all* applications.^{3} Here, we describe some of the most established, longstanding algorithms, and some more recent, promising approaches. We tested widely used $F0$ estimation algorithms for which implementations were available and hence are convenient for testing; we do not claim to have made an exhaustive comparison of the full range of $F0$ estimation algorithms. There have been various approaches attempting to categorize $F0$ estimation algorithms, mainly for methodological presentation purposes.^{3} One useful way is to cluster them as time-domain approaches (most time-domain approaches rely on autocorrelation, such as praat presented in Sec. III A 1, and some rely on cross-correlation such as rapt presented in Sec. III A 3), or frequency domain approaches (frequency spectrum and cepstral approaches). A further distinction for time-domain approaches can be made if $F0$ estimation algorithms work on windows (*frames*), thus providing local $F0$ estimates, or detect single glottal cycles, thus providing instantaneous $F0$ estimates. The $F0$ estimation algorithms that use short time windows are typically applied to a small, pre-specified segment of the signal (e.g., 10 ms), and the $F0$ estimates are obtained by a sliding window method. A further differentiation of time-domain $F0$ estimation algorithms is the method used to estimate $F0$, the most common being peak picking (for example, identifying successive negative or positive peaks) and waveform matching (matching cycle to cycle waveforms). The overall consensus is in favor of waveform matching because of its improved robustness against noise.^{32} We stress that the above general description is not the only practical categorization framework, and in fact some $F0$ estimation algorithms can equally well be interpreted as time- or frequency-domain approaches (for example, see ndf presented in Sec. III A 8).

Many of the $F0$ estimation algorithms we examine here have three main stages:^{3} (a) pre-processing, (b) identification of possible $F0$ candidates, and (c) post-processing to decide on the final $F0$ estimate. The pre-processing step depends on the actual $F0$ estimation algorithm requirements. One example of pre-processing is low-pass filtering of the speech signal to remove formants. This step is useful in general, but can also introduce problematic artifacts: reducing the bandwidth increases the inter-sample correlation and could be detrimental to $F0$ estimation algorithms which detect periodicity using correlations.^{3} Post-processing is typically used to avoid sudden jumps in successive $F0$ estimates, which may not be physiologically plausible (but this is not universally true in all applications). One straightforward and simple post-processing approach is to use running median filtering (for example, see yin presented in Sec. III A 6) or dynamic programming (for example, see dypsa presented in Sec. III A 1) to refine the estimates; we will see both approaches used in the description of specific $F0$ estimation algorithms.

In all cases we used the default settings for the $F0$ estimation algorithms. To ensure a fair comparison, where appropriate we set the $F0$ search range between 50 and 500 Hz. Although the expected physical maximum $F0$ cannot, realistically, be so high in the case of comfortably produced sustained vowel /a/ signals, we wanted to test the full range of inputs to the $F0$ estimation algorithms. Since this study only deals with voiced speech and there is no need to identify whether parts of the speech signal are voiced or unvoiced, that (very interesting) aspect of the $F0$ estimation algorithms will not be addressed here. To avoid putting those $F0$ estimation algorithms that inherently detect voiced or unvoiced frames at disadvantage, where possible this option was disabled.

#### 1. dypsa

The dynamic programming projected phase-slope algorithm (dypsa) (Ref. 33) is the only $F0$ estimation algorithm used in this study which aims to directly identify the glottal closure instances (i.e., works on the vocal cycles and not on time windows). It identifies candidate glottal closure events and uses dynamic programming to select the most plausible event by finding the optimum compromise for a set of criteria (such as minimizing the time difference between successive glottal cycles).

#### 2. praat (two algorithms, praat_{1} and praat_{2})

The praat $F0$ estimation algorithm^{31} was originally proposed by Boersma.^{34} It can be viewed as a time-domain approach which relies on autocorrelation to compute $F0$ estimates. The signal is divided into frames using an appropriate window function to minimize spectral leakage, and $F0$ estimates are provided for each frame. praat normalizes the autocorrelation of the signal by dividing the autocorrelation of the signal with the autocorrelation of the window function. The original algorithm^{34} used the Hanning window, but Boersma has later indicated that praat provides improved estimates when the Gaussian window is used. We tested both approaches: we call praat_{1} the $F0$ estimation algorithm using the Hanning window and praat_{2} the algorithm using the Gaussian window. praat uses post-processing to reduce large changes in successive $F0$ estimates (post-processing was used for both praat_{1} and praat_{2}).

#### 3. rapt

rapt is a time-domain $F0$ estimation algorithm (like praat) but it uses the normalized cross-correlation instead of the autocorrelation function. It was originally proposed by Talkin.^{3} rapt compares frames of the original speech signal with *sub-sampled* frames of the original signal, and attempts to identify the time delay where the maxima of the cross-correlation is closest to 1 (excepting the zero time lag which is 1 by definition). Once $F0$ candidates for each frame have been chosen, rapt uses dynamic programming to determine the most likely estimate for each frame.

#### 4. shrp

shrp computes $F0$ estimates in the frequency domain using the sub-harmonics to harmonics ratio, and aims to estimate pitch. It was proposed by Sun^{35} who found in a series of experiments that pitch is perceived differently when sub-harmonics in a signal increase. Therefore, he proposed a criterion for analyzing the spectral peaks that should be used to determine pitch.

#### 5. swipe

The sawtooth waveform inspired pitch estimator (swipe) algorithm was recently proposed by Camacho and Harris,^{36} and as with shrp, it is a frequency domain approach that estimates pitch. Instead of focusing solely on harmonic locations (peaks in the spectrum) as in shrp, swipe uses the available information on the entire spectrum using kernels. swipe identifies the harmonics in the square root of the spectrum and imposes kernels with decaying weights on the detected harmonic locations. We clarify that here we used swipe′, an extension of swipe which was also proposed in the original study,^{36} but we refer to it as swipe for notational simplicity.

#### 6. yin

Conceptually, yin is similar to praat and relies on the autocorrelation function^{37} to provide $F0$ estimates at pre-specified time intervals. It uses a modified version of the average squared difference function: expanding the squared expression results in the autocorrelation function and two additional corrective terms. The authors demonstrated that these two additional terms account for yin's improved performance over the naive use of autocorrelation. yin uses a final post-processing similar to median filtering to avoid spurious peaks in successive $F0$ estimates.

#### 7. tempo

The tempo algorithm was proposed by Kawahara *et al.*^{38} and uses the log frequency domain. A filter bank of equally spaced band-pass Gabor filters is used to map the central filter frequency to the instantaneous frequency of the filter outputs. The original proposal suggested using 24 Gabor filters in an octave, and the instantaneous angular frequency is obtained using the Hilbert transform.

#### 8. ndf

The nearly defect-free (ndf) $F0$ estimation algorithm was proposed by Kawahara *et al.*^{39} and relies on both time-domain and frequency-domain information to provide $F0$ estimates. The algorithm combines two components to determine $F0$ candidate values: (a) an instantaneous frequency based-extractor and (b) a period-based extractor. The frequency-based extractor is similar to tempo, and the period-based extractor computes sub-band autocorrelations using the fast Fourier transform, where the power spectra are initially normalized by their spectral envelope prior to the computation of the autocorrelations. Then, the $F0$ candidates from the instantaneous frequency and period-based extractors are mixed using the normalized empirical distribution of side information to determine the most likely candidates.

#### 9. xsx

The excitation structure extractor (xsx) was recently proposed by Kawahara *et al.*^{40} These researchers wanted to provide a fast alternative to ndf (see the preceding section), which their experiments demonstrated to be very accurate, but also computationally demanding. xsx relies on spectral division using two power spectral representations. xsx uses a set of $F0$ detectors spaced equidistantly on the log-frequency axis which cover the user specified $F0$ range.

### B. Information fusion with adaptive KF

So far we have described ten popular $F0$ estimation algorithms, some of which are longstanding and established, and others which were proposed more recently. Since there is no universally single best $F0$ estimation algorithm^{3,4} and different $F0$ estimation algorithms may be in their optimal setting under different signal conditions, it is possible that combining the outputs of the $F0$ estimation algorithms could lead to improved $F0$ estimates. Recently, Tsanas *et al.*^{10} proposed a simple ensemble approach to obtain the $F0$ time series by introducing fixed weights for three of the $F0$ estimation algorithms described in the preceding sections (praat_{1}, rapt, and shrp). In this study, we investigate more thoroughly the concept of combining an arbitrary number of $F0$ estimation algorithms with adaptive weights to reflect our trust in the estimate of each $F0$ estimation algorithm.

KF is a simple yet powerful technique which can be used for fusing information from different sources, and has been successfully used in many applications over the last 40 years.^{41} The *sources* (here $F0$ estimation algorithms) provide information which may be potentially redundant or complementary in terms of estimating the underlying (physiological) quantity of interest, usually referred to as the *state* (here $F0$). The aim is to fuse the information from the measurements (ten scalar values, one for each of the ten $F0$ estimation algorithms at each step where we have $F0$ estimates) recursively updating the state over time (for $F0$ estimation applications, this is usually every 10 ms). Specifically, the KF in its general basic form has the following mathematical formalization:

where $xk$ is the state, $Ak$ is the state transition model to update the previous state, $Bk$ is the control-input model which is applied to the control vector $uk$, $wk$ is the state process noise which is assumed to be drawn from a multivariate Gaussian distribution with covariance $Qk$, $zk$ is the measurement of the state $xk$, $Ck$ is the measurement model which maps the underlying state to the observation, and $vk$ is the measurement noise which is assumed to be drawn from a multivariate Gaussian distribution with covariance $Rk$.

It is known from the literature that KF is the optimal state estimation method (in the least squares sense) for a stochastic signal under the following assumptions:^{42} (a) the underlying evolving process of successive states is linear and known, (b) the noise of the state $wk$ and the noise of the measurements $vk$ are Gaussian, and (c) the state noise covariance $Qk$ and the measurement noise covariance $Rk$ are known. In practice, we often assume the first two conditions are met, but the KF may not give optimal results if the estimates of the state noise covariance and the measurement noise covariance are inaccurate.^{41} This requirement has led many researchers to pursue intensively the notion of inferring good covariance estimates from the data.^{20,42,43} Although techniques relying solely on the data to estimate the measurement noise covariance and the state noise covariance offer a convenient automated framework,^{42} they fail to take into account domain knowledge which may be critical. Therefore, methods which could incorporate this potentially useful additional information have been investigated more rigorously recently. Particularly promising in this regard is the approach pioneered by Li *et al.*^{20} and more recently also applied by Nemati *et al.*^{43} with the introduction of physiologically informed *signal quality indices* (SQIs), which reflect the confidence in the measurements of each source. When the SQI is low, the measurement should not be trusted; this can be achieved by increasing the noise covariance. Algorithmically, the dependence of the measurement noise covariance on the SQIs is defined using the logistic function where the independent variable is the SQI.^{20}

Both Li *et al.*^{20} and Nemati *et al.*^{43} have used SQIs to determine *only* the measurement noise covariance; they set the state noise covariance to a constant scalar value which was empirically optimized. Effectively, using a constant state noise covariance corresponds to assuming that the confidence in the state value does not change as a function of the *a priori* estimate of the state, the measurements, and their corresponding SQIs, and may well not be making full use of the potential of SQIs. In this study, *both* the state noise and the measurement noise covariance are adaptively determined based on the SQI (whereas in Li *et al.*^{20} and Nemati *et al.*^{43} the state noise was *a priori* fixed). Another difference between the current study and previous studies^{20,43} is that we process a *single* primary signal (speech signal) from which we obtain various measurements for the quantity of interest ($F0$), whereas previously Li *et al.*^{20} and Nemati *et al.*^{43} extracted an estimate for their quantity of interest from each of the *multiple* primary signals they processed. Hence, the nature of the SQIs defined in those studies, which relied on the quality of each of the primary signals, will necessarily be different to the SQIs that will be defined here. Furthermore, they have used a very simplified KF setting, processing each source independently from the other sources: this facilitates the algorithmic processing since all matrices become vectors, and all vectors become scalars for a scalar state. Then, they combined the multiple KF results with an additional *external* function based on the KF residuals and the computed SQIs. However, we argue that the approach by Li *et al.*^{20} and Nemati *et al.*^{43} fails to capitalize on the full strength of the adaptive KF as a data fusion mechanism where measurements from all sources are combined *within* the KF framework. This is because in their approach each estimate from each source is only compared to the *a priori* state without also taking into account the estimates of the other sources. Moreover, we will demonstrate that we can advantageously exploit the fact that the information from all measurements is simultaneously processed in KF to adjust the SQIs.

#### 1. Formulation of the adaptive KF setting in this study

We have so far described the general notation of the KF. Here we explicitly describe the KF setting used in this study and set values to the KF parameters. For convenience, we will now simplify notation where appropriate, e.g., to denote vectors or scalars for the current application instead of the general formulation with matrices and vectors. We start by noting that the state in this application is a single scalar $xk$. We also assume that consecutive $F0$ estimates are expected to remain unchanged; that is, the *a priori* estimate of the current state $x\u0303k$ will be the previous state: $x\u0303k=xk\u22121$. Implicitly, we have assumed $Ak=1$ and $Bk=0$. Similarly, we set $Ck=1$, where the notation $1$ denotes a vector with 10 elements equal to 1 (the length of the vector $Ck$ is equal to the number of $F0$ estimation algorithms and is constant in this application). The aim of the adaptive KF then is to use the measurements $zk$ (a vector with ten elements which correspond to the estimates of the ten $F0$ estimation algorithms at time *k*) to update $x\u0303k$ to the new estimated state $xk$. Next we focus on how to determine the state noise covariance $Qk$ (a scalar since the state is scalar) and the measurement noise covariance $Rk$ based on the SQIs.

#### 2. SQIs

For the purposes of the current study, the SQIs can be thought of as *algorithmic robustness metrics*, and express our confidence in the estimate of each $F0$ estimation algorithm at a particular instant. In this study, we define novel SQIs to continuously update *both* the measurement noise covariance and state noise covariance as functions of the SQIs using the logistic function. The final SQI, which will be used to update the noise covariances, is a combination of *bonuses* and *penalties* for each of the individual $F0$ estimation algorithms at each discrete time step. The main underlying ideas for setting up the bonuses and penalties are: (a) in most cases, we expect successive $F0$ estimates not to vary considerably, (b) all $F0$ estimation algorithms occasionally give very bad $F0$ estimates in some instances, or for entire speech signals, (c) ndf and swipe appear very robust in this application, and in most cases their estimates are trustworthy, (d) ndf is typically closest to the ground truth but sporadically gives very bad $F0$ estimates, whereas swipe may be slightly less accurate but more consistent (i.e., very rarely provides poor $F0$ estimates). These ideas were drawn by first investigating the behavior of the individual $F0$ estimation algorithms and will become clear later when looking at Sec. IV A.

We use the standard S-shaped curved membership function (spline-based curve, very similar to the sigmoid function) to map each bonus and each penalty to a scalar in the range 0 to 1. This function relies on two independent variables $k1$ and $k2$ ($k1<k2$) to set thresholds, and is defined as

The rationale for using this function is that we want to suppress the values that are close to the thresholds and have a smooth transition in the range $k1$ to $k2$. Now, we outline the layout form of the penalties which determine the SQIs, and in turn $Qk$ and $Rk$. Overall, the confidence in the current measurement $zk$ is quantified via the SQIs and is given by

The following paragraphs explain in detail how each of the penalties and bonuses are determined. The first penalty we introduce, $p1k$, penalizes the $F0$ estimation algorithms for having large absolute differences in their successive estimates: $p1k=0.25SS(|zk\u2212zk\u22121|,0,100)$. We also penalize the $F0$ estimation algorithms for exhibiting large absolute differences from their corresponding robust mean estimates (defined as the mean estimate of each of the $F0$ estimation algorithms using only the corresponding $F0$ estimates which fall within the 10th and 90th percentile, denoted with $zrobust$): $p2k=0.25SS(|zk\u2212zrobust|,0,100)$. We use the robust mean because some $F0$ estimation algorithms occasionally exhibit irrational behavior (i.e., very bad estimates for some instances). Similarly, we penalize the $F0$ estimation algorithms if the estimate for the current $F0$ is considerably different from the *a priori* estimate $x\u0303k$ (to be mathematically formally correct, we create a vector with 10 entries with $x\u0303k$, i.e., $x\u0303k=1\u22c5x\u0303k$): $p3k=0.75SS(|zk\u2212x\u0303k|,0,100)$. We clarify that we penalize considerably more the algorithms which are far from the *a priori* estimate of $F0$ with $p3k$, rather than for *inconsistency* (penalty $p1k$ which penalizes large absolute successive differences focusing individually within each $F0$ estimation algorithm).

Then, we determine which $F0$ estimation algorithm is the “best expert at the current instant” in order to have good prior information to determine the current $F0$ estimate. This essentially reflects whether to trust more ndf or swipe, and is achieved by adding up the corresponding three penalties introduced so far for ndf and swipe. Then, we apply the following logic: (a) if the estimated $F0$ from ndf and swipe at the current discrete step differs by less than 50 Hz, and the sum of all penalties for both ndf and swipe is less than 0.2 (i.e., both ndf and swipe are considered trustworthy), then we trust the $F0$ estimate from ndf, (b) otherwise, we trust ndf or swipe, whichever has the lowest summed penalty score. The choice of 50 Hz to quantify large deviation in the $F0$ estimates of an $F0$ estimation algorithm with respect to ndf or swipe was chosen empirically based on prior knowledge; we decided not to formally optimize this value to avoid overfitting the current data (also, it is possible that a relative threshold might be more appropriate).

We denote the estimate from ndf or swipe as $x\u02c7k,best=zk\u2009(NDForSWIPE)$. Next, we introduce another penalty for the $F0$ estimation algorithms which at the current instant have an estimate that differs considerably from $x\u02c7k,best:\u2009p4k=0.75SS(|zk\u22121\u22c5x\u02c7k,best|,0,50)$. In this case, the $F0$ estimation algorithm which is believed to be “best” is not penalized. This is achieved by penalizing ndf or swipe (whichever is considered “best” at the current instance) by $p4k,best=(1\u2212p1k,best\u2212p2k,best\u2212p3k,best).$

It is possible that an $F0$ estimation algorithm may have been substantially misguided in its previous $F0$ estimate(s), but its estimate for the current $F0$ is close to the “right region,” which is defined as being close to the best $F0$ expert at the current instant (as described above, this is the estimate by ndf or swipe). In this case, we want to reduce the heavy penalty induced by the large successive difference in $F0$ estimates. Therefore, we introduce a bonus to compensate for the penalties $pk,1$ and $pk,2$, which takes into account how confident we are on the estimate of the best $F0$ estimation algorithm. Specifically, we define

where ⊙ denotes element-wise multiplication between two vectors. We clarify that we use the multiplication dot to denote multiplication between a scalar and a vector. Moreover, if $p4k,best<0.2$ we give extra bonus to the best $F0$ estimation algorithm: $bk,best=3$. This effectively means we assign greater confidence in the estimate of the $F0$ estimation algorithm that we deem is most accurate if the penalties introduced so far for this algorithm sum to a value less than 0.2. As a final check, any negative $SQIk$ is set to zero. Also, if the $F0$ estimate from an $F0$ estimation algorithm differs by 50 Hz or more from both the $F0$ estimate of ndf and swipe, the corresponding SQI is automatically set to zero. Following Li *et al.*,^{20} we use the logistic function to estimate the measurement noise covariance $Rk$. Note that Li *et al.*^{20} used a scalar $Rk$ for each source which was processed *independently* from the other sources in the KF framework, and fused information from the sources *externally* to KF to provide the final state estimate. Therefore, their scheme did not take advantage of the potential to fuse information *internally* in KF, where we determine SQIs also using information conveyed from the remaining sources. Here we retain the matrix formulation

where $Rk0$ has some pre-defined constant values. We set the diagonal entries of $Rk0$ to values that reflect our prior confidence in each $F0$ estimation algorithm (higher value denotes lower confidence). Here, we set the diagonal entries in $Rk0$ corresponding to ndf and swipe to 1, and all other entries to 3 (hence, *a priori* we believe more the estimates by ndf and swipe, although this prior belief is subject to be updated with the SQIs which in turn will update $Rk$). Non-diagonal entries were set to zero. It is not straightforward to optimize the appropriate non-diagonal entries so as to reflect possible interactions among the $F0$ estimation algorithms (for example, a setting where an $F0$ estimation algorithm provides poor estimates, whereas another $F0$ estimation algorithm works particularly well).

Finally, whereas the measurement noise covariance is estimated via the logistic function and SQI, the state noise covariance is estimated as follows:

where $L$ is the number of $F0$ estimation algorithms with corresponding $SQIk,j$ larger than 0.8. The concept behind this expression in the first clause is that the measurements of ndf and swipe cannot be trusted if both ndf and swipe have relatively low SQIs, and hence the adaptive KF will tend to trust more the *a priori* estimate. Conversely (in the second clause), if all $F0$ estimation algorithms weighted by their respective SQI (when their SQI is larger than a threshold of 0.8) point towards a large change in successive steps in the $F0$ contour, we want to increase $Qk$ so that KF will trust considerably more the new measurements. Note that if the $F0$ estimation algorithms for which we have large respective SQIs point towards the same direction of change in $F0$ (i.e., a sudden increase or decrease), then the $Qk$ will increase considerably and hence the KF will weight only on the current measurements and not trust the *a priori*$F0$ estimate.

The matlab source code for the adaptive KF and the computation of the SQIs is available on request by contacting the first author.

### C. Benchmarks: Median and ensemble learning

As standard simple benchmarks of combining information from multiple sources, we used the median from all $F0$ estimation algorithms for each instant, and also two ensembles to weigh the estimates of the $F0$ estimation algorithms: (a) the standard ordinary least squares (OLS) and (b) a statistically robust form of least squares, the iteratively reweighted least squares (IRLS), which is less sensitive to outliers.^{21} The ensembles used all but one signal for training and test on the signal left out of the training process; the procedure is repeated for all signals and the results were averaged. Because the two databases in the study have widely different ground truth $F0$ distributions (see Fig. 1), the ensembles were trained separately for the two databases.

### D. Ground truth and validation framework

Most $F0$ estimation algorithms provide estimates at specific time intervals (typically at successive instances using a fixed time window of a few milliseconds). Here, wherever possible, we obtained $F0$ estimates from the $F0$ estimation algorithms every 10 ms, at the reference time instances [60, 70,…, 950] ms (thus, we have 90 $F0$ values for each synthetic phonation signal and for each $F0$ estimation algorithm or the ensemble of the $F0$ estimation algorithms). Given that the synthetic speech signals exhibit inherent instabilities because the physiological model requires some 4–5 vocal cycles to settle into stable oscillation (see Sec. II A), and that many $F0$ estimation algorithms provide reliable estimates only after a few milliseconds into the speech signal, we discarded the $F0$ estimates prior to 60 ms. A few $F0$ estimation algorithms do not provide $F0$ estimates at pre-specified time intervals, but at intervals which are identified as part of the algorithm (this is the case with rapt, for example). In those cases where the $F0$ estimation algorithms do not provide $F0$ estimates at the exact time instances described above, we used piecewise linear interpolation between the two closest time intervals of the $F0$ estimation algorithm to obtain the $F0$ estimate at the reference time instances. The time instances where $F0$ was estimated in rapt did not differ considerably from the reference time instances, and thus piecewise linear interpolation should not markedly affect its performance.

The ground truth $F0$ time series from the physiological model and the sigma algorithm^{30} is given in the form of glottal closure time instances, which are directly translated to $F0$ estimates in Hz. However, we need to obtain ground truth $F0$ values at the reference time instances. Hence, piecewise linear interpolation was used to obtain the ground truth at the reference instances. Similarly, we used piecewise linear interpolation to obtain $F0$ estimates from dypsa at the reference time instances (dypsa is the only $F0$ estimation algorithm in this study that aims to identify glottal closure instances, instead of using time windows).

Summarizing, each $F0$ estimation algorithm or ensemble of $F0$ estimation algorithms provides 90 $F0$ estimates for every speech signal. These estimates for every speech signal are compared against the 90 ground truth $F0$ scores at the reference instances. In total, we processed (a) 117 synthetic speech signals generated using the physiological model which provide $N=117\xd790=10\u2009530$ values, and (b) 65 actual speech signals which provide $N=65\xd790=5850$ values over which we compare the performance of the $F0$ estimation algorithms and ensembles. In a few cases, the algorithms praat_{2} and tempo failed to provide outputs (towards the beginning or end of the signal). Those instances were substituted with the estimates from ndf for computing the praat_{2} and tempo overall errors (for the KF fusion we simply assumed no measurement was available by the corresponding $F0$ estimation algorithm which had no estimate at those instances). Overall, the $F0$ outputs from the ten $F0$ estimation algorithms were concatenated into two matrices: $X1$ with $10\u2009530\xd710$ elements for the speech signals generated from the physiological model, and $X2$ with $5850\xd710$ elements for the actual speech signals. The ensembles of the $F0$ estimation algorithms are directly computed using these matrices. The ground truth was stored in two vectors: $y1$which comprised $N=10\u2009530$ elements for the generated speech signals, and $y2$ which comprised $N=5850$ elements for the actual speech signals.

The deviation from the ground truth for each signal and each $F0$ estimation algorithm is computed as $ei=y\u0302i\u2212yi$, where $y\u0302i$ is the *i*th $F0$ estimate ($i\u22081,\u2026,90$), and $yi$ is the *i*th ground truth $F0$ value. We report three performance measures: (a) mean absolute error (MAE), (b) the mean relative error (MRE), and (c) the root mean squared error (RMSE) (endorsed by Christensen and Jakobsson^{1} in evaluating $F0$ estimation algorithms). The MRE is similar to one of the performance measures used in Parsa and Jamieson,^{5} but without squaring the error and the ground truth values (thus placing less emphasis on large errors). The RMSE is always equal to or greater than the MAE, and is particularly sensitive to the presence of large errors. The larger the variability of the errors, the larger the difference between MAE and RMSE. Therefore, these metrics are complementary when assessing the performance of the $F0$ estimation algorithms. In this study we focus on approaches combining $F0$ estimates with the aim to minimize the mean squared error (implicitly in KF). Therefore, RMSE is the primary error metric of interest to compare our findings. The metrics are defined as follows:

where *N* is the number of $F0$ instances to be evaluated for each speech signal (here 90), and $S$ contains the 90 indices of each speech signal in the estimate of each $F0$ estimation algorithm and in $y$. Error metrics from all speech signals are averaged, and are presented in the form mean ± standard deviation.

## IV. RESULTS

This section follows the same structure as in Sec. III: first we compare the performance of the ten individual $F0$ estimation algorithms, and then we study the performance of the information fusion approach with the adaptive KF.

### A. Performance of the ten individual *F*_{0} estimation algorithms

Figure 2 presents the probability density estimates of the errors $(y\u0302i\u2212yi)i=1N$ for the ten $F0$ estimation algorithms. The probability densities were computed using kernel density estimation with Gaussian kernels. These results provide a succinct overview of the comparative accuracy of each $F0$ estimation algorithm, as well as indicating whether it is symmetric (with respect to overestimating and underestimating the true $F0$). The error distributions of most $F0$ estimation algorithms are closely symmetric, suggesting that there is no large positive or negative bias in most of the algorithms. This is also quantitatively reflected in the median errors reported in Tables I and II, where all $F0$ estimation algorithms exhibit a bias which is lower than 1 Hz. Two notable exceptions are yin and rapt which appear to underestimate considerably $F0$ for the database with the synthetic signals. Figure 3 presents the number of times that each of the $F0$ estimation algorithms was closer to the ground truth $F0$ (reflecting the success of each of the $F0$ estimation algorithms). Interestingly, there is no clear winner among the $F0$ estimation algorithms in terms of accurately estimating $F0$ for individual samples in the $F0$ contour for the two databases [Figs. 3(a) and 3(c)]. On the other hand, ndf is clearly the most successful $F0$ estimation algorithm in terms of being closer to the ground truth when studying the entire signal [Figs. 3(b) and 3(d)]. Table I summarizes the average results in terms of estimating $F0$ for the database with the generated speech signals, and Table II summarizes the results for the database with the actual speech signals. Overall, all $F0$ estimation algorithms have reasonably accurate performance.

Algorithm . | ME (Hz) . | MAE (Hz) . | MRE (%) . | RMSE (Hz) . |
---|---|---|---|---|

dypsa | 0.02 | 3.79 ± 5.57 | 3.30 ± 5.41 | 7.20 ± 13.44 |

praat_{1} | 0.00 | 10.73 ± 22.09 | 7.42 ± 14.64 | 12.46 ± 22.33 |

praat_{2} | 0.02 | 6.56 ± 15.46 | 4.68 ± 10.26 | 8.81 ± 17.43 |

rapt | −3.98 | 9.20 ± 8.91 | 6.64 ± 6.17 | 19.95 ± 14.85 |

shrp | −0.23 | 3.67 ± 7.06 | 2.83 ± 5.08 | 7.17 ± 10.34 |

swipe | 0.18 | 2.88 ± 7.10 | 2.37 ± 5.57 | 3.59 ± 7.59 |

yin | −10.71 | 17.41 ± 16.87 | 11.90 ± 10.76 | 29.90 ± 22.95 |

ndf | 0.00 | 2.38 ± 6.71 | 1.90 ± 4.92 | 3.16 ± 7.74 |

tempo | 0.00 | 2.53 ± 6.64 | 2.01 ± 4.87 | 3.34 ± 7.53 |

xsx | 0.01 | 3.00 ± 7.10 | 2.38 ± 5.55 | 3.73 ± 7.58 |

Median | −0.39 | 3.00 ± 7.28 | 2.31 ± 5.23 | 4.27 ± 8.91 |

OLS | 0.02 | 3.49 ± 5.63 | 2.72 ± 4.14 | 4.60 ± 6.49 |

IRLS | 0.00 | 2.34 ± 7.06 | 1.89 ± 5.21 | 3.34 ± 9.43 |

KF | 0.02 | 2.19 ± 6.54 | 1.73 ± 4.70 | 2.72 ± 6.84 |

Algorithm . | ME (Hz) . | MAE (Hz) . | MRE (%) . | RMSE (Hz) . |
---|---|---|---|---|

dypsa | 0.02 | 3.79 ± 5.57 | 3.30 ± 5.41 | 7.20 ± 13.44 |

praat_{1} | 0.00 | 10.73 ± 22.09 | 7.42 ± 14.64 | 12.46 ± 22.33 |

praat_{2} | 0.02 | 6.56 ± 15.46 | 4.68 ± 10.26 | 8.81 ± 17.43 |

rapt | −3.98 | 9.20 ± 8.91 | 6.64 ± 6.17 | 19.95 ± 14.85 |

shrp | −0.23 | 3.67 ± 7.06 | 2.83 ± 5.08 | 7.17 ± 10.34 |

swipe | 0.18 | 2.88 ± 7.10 | 2.37 ± 5.57 | 3.59 ± 7.59 |

yin | −10.71 | 17.41 ± 16.87 | 11.90 ± 10.76 | 29.90 ± 22.95 |

ndf | 0.00 | 2.38 ± 6.71 | 1.90 ± 4.92 | 3.16 ± 7.74 |

tempo | 0.00 | 2.53 ± 6.64 | 2.01 ± 4.87 | 3.34 ± 7.53 |

xsx | 0.01 | 3.00 ± 7.10 | 2.38 ± 5.55 | 3.73 ± 7.58 |

Median | −0.39 | 3.00 ± 7.28 | 2.31 ± 5.23 | 4.27 ± 8.91 |

OLS | 0.02 | 3.49 ± 5.63 | 2.72 ± 4.14 | 4.60 ± 6.49 |

IRLS | 0.00 | 2.34 ± 7.06 | 1.89 ± 5.21 | 3.34 ± 9.43 |

KF | 0.02 | 2.19 ± 6.54 | 1.73 ± 4.70 | 2.72 ± 6.84 |

Algorithm . | ME (Hz) . | MAE (Hz) . | MRE (%) . | RMSE (Hz) . |
---|---|---|---|---|

dypsa | −0.78 | 14.42 ± 26.32 | 5.54 ± 8.44 | 25.86 ± 32.89 |

praat_{1} | −0.03 | 29.22 ± 57.23 | 13.28 ± 24.08 | 31.67 ± 57.10 |

praat_{2} | −0.03 | 29.05 ± 56.86 | 13.21 ± 24.00 | 31.47 ± 56.71 |

rapt | −0.04 | 28.30 ± 63.47 | 8.63 ± 17.98 | 34.21 ± 65.89 |

shrp | −0.01 | 18.78 ± 47.77 | 6.85 ± 16.86 | 26.91 ± 55.21 |

swipe | 0.10 | 3.06 ± 7.01 | 1.18 ± 2.48 | 6.22 ± 13.46 |

yin | −0.03 | 16.36 ± 47.34 | 6.16 ± 16.32 | 23.35 ± 51.77 |

ndf | −0.01 | 15.12 ± 60.66 | 4.16 ± 15.24 | 17.66 ± 60.87 |

tempo | −0.03 | 50.67 ± 99.23 | 17.69 ± 31.08 | 53.21 ± 100.92 |

xsx | −0.08 | 33.43 ± 52.11 | 16.85 ± 25.90 | 39.57 ± 56.81 |

Median | −0.17 | 18.90 ± 46.27 | 7.71 ± 18.11 | 24.71 ± 49.15 |

OLS | −0.78 | 4.08 ± 7.76 | 1.55 ± 2.62 | 7.58 ± 13.82 |

IRLS | −0.03 | 3.17 ± 7.03 | 1.23 ± 2.49 | 6.53 ± 13.57 |

KF | −0.03 | 2.49 ± 5.04 | 0.97 ± 1.82 | 4.95 ± 9.19 |

Algorithm . | ME (Hz) . | MAE (Hz) . | MRE (%) . | RMSE (Hz) . |
---|---|---|---|---|

dypsa | −0.78 | 14.42 ± 26.32 | 5.54 ± 8.44 | 25.86 ± 32.89 |

praat_{1} | −0.03 | 29.22 ± 57.23 | 13.28 ± 24.08 | 31.67 ± 57.10 |

praat_{2} | −0.03 | 29.05 ± 56.86 | 13.21 ± 24.00 | 31.47 ± 56.71 |

rapt | −0.04 | 28.30 ± 63.47 | 8.63 ± 17.98 | 34.21 ± 65.89 |

shrp | −0.01 | 18.78 ± 47.77 | 6.85 ± 16.86 | 26.91 ± 55.21 |

swipe | 0.10 | 3.06 ± 7.01 | 1.18 ± 2.48 | 6.22 ± 13.46 |

yin | −0.03 | 16.36 ± 47.34 | 6.16 ± 16.32 | 23.35 ± 51.77 |

ndf | −0.01 | 15.12 ± 60.66 | 4.16 ± 15.24 | 17.66 ± 60.87 |

tempo | −0.03 | 50.67 ± 99.23 | 17.69 ± 31.08 | 53.21 ± 100.92 |

xsx | −0.08 | 33.43 ± 52.11 | 16.85 ± 25.90 | 39.57 ± 56.81 |

Median | −0.17 | 18.90 ± 46.27 | 7.71 ± 18.11 | 24.71 ± 49.15 |

OLS | −0.78 | 4.08 ± 7.76 | 1.55 ± 2.62 | 7.58 ± 13.82 |

IRLS | −0.03 | 3.17 ± 7.03 | 1.23 ± 2.49 | 6.53 ± 13.57 |

KF | −0.03 | 2.49 ± 5.04 | 0.97 ± 1.82 | 4.95 ± 9.19 |

The best individual $F0$ estimation algorithms, on average, are ndf for the database with the synthetic signals and swipe for the database with the actual speech recordings. Some algorithms temporarily deviate considerably from the ground truth, but overall there was good agreement on the actual and estimated $F0$ contour. Nevertheless, for some signals most of the $F0$ estimation algorithms had consistently underestimated or overestimated $F0$ for the entire duration of the signal. This was particularly evident for the database with the actual speech signals: the only $F0$ estimation algorithm which did not exhibit such erratic behavior was swipe. The findings in Tables I and II might at first appear contradictory with the findings in Fig. 3 where we might have expected ndf and tempo to dominate. In fact, they highlight the fact that overall ndf and tempo may occasionally deviate considerably from the ground truth (this is reflected in the large standard deviation of the errors reported in Table II).

### B. Performance of *F*_{0} estimation combinations

The last four rows in Tables I and II summarize the performance of approaches which combine the outputs of the individual $F0$ estimation algorithms to obtain the final $F0$ estimates. We remark that KF leads to considerable improvement for both the database with the generated speech signals (Table I), and the database with the actual speech signals (Table II). The relative RMSE improvement of the adaptive KF over the single best $F0$ estimation algorithm $(|RMSEKF\u2212RMSENDF\u2009or\u2009SWIPE|/RMSEKF)$ is 16.2% compared to ndf for the database with the generated signals, and 25.6% compared to swipe for the database with the actual speech signals. Figure 4 presents the performance of the best individual $F0$ estimation algorithm versus the best combination scheme for all signals: in the vast majority of speech signals the adaptive KF scheme is more accurate than the single best $F0$ estimation algorithm, and when not, the drop in performance is negligible.

We can investigate the contribution of each $F0$ estimation algorithm in the KF scheme by studying their corresponding SQIs, which are summarized in Table III. In both databases, the greatest contribution comes from ndf (and to a lesser degree from swipe). The results in Table III suggest that the KF scheme mostly considers ndf to be closest to the ground truth compared to the competing $F0$ estimation algorithms (particularly for the synthetic data). The $F0$ estimation algorithms were generally more accurate in predicting $F0$ in the database with the synthetic signals compared to the database with the actual speech signals, which is reflected in the SQIs for the two databases. In the database with the synthetic signals, the $F0$ estimation algorithms are typically not heavily penalized (the SQI values are fairly close to the default value 1); whereas in the database with the actual speech signals the SQI values for each $F0$ estimation algorithm were considerably more variable.

Algorithm . | Synthetic signals . | Actual signals . |
---|---|---|

dypsa | 0.97 ± 0.11 | 0.89 ± 0.16 |

praat_{1} | 0.87 ± 0.31 | 0.78 ± 0.41 |

praat_{2} | 0.94 ± 0.20 | 0.78 ± 0.41 |

rapt | 0.91 ± 0.09 | 0.85 ± 0.32 |

shrp | 0.98 ± 0.03 | 0.92 ± 0.22 |

swipe | 1.00 ± 0.06 | 1.12 ± 0.59 |

yin | 0.81 ± 0.18 | 0.92 ± 0.23 |

ndf | 3.99 ± 0.08 | 3.80 ± 0.85 |

tempo | 1.00 ± 0.01 | 0.77 ± 0.42 |

xsx | 0.99 ± 0.06 | 0.75 ± 0.40 |

Algorithm . | Synthetic signals . | Actual signals . |
---|---|---|

dypsa | 0.97 ± 0.11 | 0.89 ± 0.16 |

praat_{1} | 0.87 ± 0.31 | 0.78 ± 0.41 |

praat_{2} | 0.94 ± 0.20 | 0.78 ± 0.41 |

rapt | 0.91 ± 0.09 | 0.85 ± 0.32 |

shrp | 0.98 ± 0.03 | 0.92 ± 0.22 |

swipe | 1.00 ± 0.06 | 1.12 ± 0.59 |

yin | 0.81 ± 0.18 | 0.92 ± 0.23 |

ndf | 3.99 ± 0.08 | 3.80 ± 0.85 |

tempo | 1.00 ± 0.01 | 0.77 ± 0.42 |

xsx | 0.99 ± 0.06 | 0.75 ± 0.40 |

### C. Algorithmic robustness

Finally, we investigated the robustness of the $F0$ estimation algorithms when (a) the sampling frequency is reduced from 44.1 to 8 kHz for each of the 65 actual speech signals and (b) contaminating each actual speech signal with 10 dB additive white Gaussian noise (AWGN) prior to the computation of the $F0$. A robust algorithm should produce similar outputs in the reduced quality signals. Figure 5 illustrates the density estimates of the differences in the $F0$ values computed with respect to the original actual speech signals. Down-sampling the actual speech recordings from the original sampling frequency of 44.1 to 8 kHz affects mainly dypsa, and shrp. praat_{1} and praat_{2} appear to be the least affected $F0$ estimation algorithms in terms of their $F0$ estimates. Interestingly, although the performance of the best individual $F0$ estimation algorithm (swipe) had degraded considerably (the RMSE when using the 44.1 kHz-sampled signals was 6.22 ± 13.46 Hz and increased to 7.32 ± 15.37 Hz when using the 8 kHz-sampled signals), the RMSE in the KF approach remained virtually unchanged (originally the RMSE when using the 44.1 kHz-sampled signals was 4.95 ± 9.19 Hz and increased to 4.98 ± 9.25 when using the 8 kHz-sampled signals). That is, the KF approach is very robust in terms of accurately determining the $F0$ when the sampling frequency is reduced to 8 kHz. Similar findings were observed when contaminating the actual speech signals with 10 dB AWGN: the RMSE of the best individual algorithm (swipe) increased to 6.80 ± 15.25 Hz, but the RMSE of the KF approach had only slightly changed (5.07 ± 9.30 Hz). We highlight the robustness of the KF fusion approach in both lower sampling frequency signals and in the presence of AWGN, whereas swipe and ndf both degraded considerably. Moreover, we stress that not only is the average performance of the KF better (reflected in the mean value), but it is also considerably more reliable (significantly lower standard deviation in both settings). dypsa and to a lesser degree xsx appear to be the most susceptible $F0$ estimation algorithms to noise, whereas praat_{1}, praat_{2}, ndf, and swipe are again very robust.

## V. DISCUSSION AND SUMMARY

This study compared ten widely used $F0$ estimation algorithms, and investigated the potential of combining $F0$ estimation algorithms in providing $F0$ estimates for the sustained vowel /a/. We focused on $F0$ estimation algorithms which are widely used in clinical speech science, and some recently proposed $F0$ estimation algorithms. We used two databases for our investigation: (a) a database with 117 synthetic speech signals generated with a sophisticated physiological model of speech production and (b) a database with 65 actual speech signals where simultaneous EGG recordings were available. Particular care was exercised to generate sustained vowel /a/ signals which may closely resemble pathological cases using the physiological model, and also signals with low $F0$ because these signals are particularly difficult for most of the commonly used $F0$ estimation algorithms.^{29} The ground truth $F0$ in the synthetic signals was inferred from the computation of the vocal fold cycles in the model, i.e., the computation of successive instances where the glottal area was minimized. The ground truth $F0$ in the actual speech signals was deduced using the sigma algorithm^{30} from EGG recordings, and was also verified by visual inspection of the signals and the EGG plots. Therefore, in both cases the ground truth $F0$ is objective, and the aim of this study is to replicate it as accurately as possible using the speech signal alone. We remark that for the actual speech recordings the microphone and amplifier combination had a high pass cut-off frequency compared to the $F0$ in sustained vowel /a/ phonations. Reducing the high pass cut-off frequency may be beneficial for some $F0$ estimation algorithms but detrimental for others;^{33} moreover in practice it is often desirable to use the higher frequencies of the spectrum for general voice assessment analysis (in addition to determining accurately $F0$).^{10} Therefore, we have not imposed a high pass cut-off frequency which would have been closer to the upper limit of the expected $F0$ in the current application.

A ubiquitous problem in accurate $F0$ estimation is the presence of strong sub-harmonics.^{2,3} These sub-harmonics manifest as integer fractions of $F0$ in the spectrum, and in practice it is often difficult to determine whether the pitch period can be considered to be, for example, doubled as a result of the amplitude of the 1/2 sub-harmonic. Some of the $F0$ estimation algorithms use sophisticated methods to tackle the difficult task of overcoming sub-harmonics problems. For example, swipe imposes weight-decaying kernels on the first and prime harmonics of the signal to reduce the probability of mistakenly using the sub-harmonics as its $F0$estimates.^{36} shrp explicitly identifies sub-harmonics and harmonics; the $F0$ is then determined depending on the value of the ratio of their sums.^{35} yin is effectively relying on the autocorrelation function with two additional corrective terms to make it more robust to amplitude perturbations.^{37} It uses a free parameter for thresholding a normalized version of the autocorrelation function with the two corrective terms, in order to overcome the effect of strong sub-harmonics. tempo, ndf, and xsx use parabolic time-warping using information in the harmonic structure to obtain the $F0$ estimates. praat and rapt do not use any explicit mechanism for mitigating the effect of sub-harmonics.

The results reported in Table I and Table II strongly support the use of ndf and swipe as the most accurate individual $F0$ estimation algorithms. All $F0$ estimation algorithms occasionally deviated considerably from the ground truth, in particular yin and rapt. tempo was very inconsistent: overall its $F0$ contour estimates may have been accurate or largely inaccurate for the entire duration of the signal. The use of Gaussian windows in praat (praat_{2} in this study) is beneficial compared to Hamming windows (praat_{1} in this study), which is in agreement with Boersma's observation. swipe was the most consistent in terms of almost never deviating considerably from the ground truth. In Table I we have seen that, on average, $F0$ can be estimated to within less than 2.4 Hz deviation from the ground truth using ndf (swipe was slightly worse). Similarly, in Table II we reported that, on average, $F0$ can be estimated to within about 3 Hz deviation from the ground truth using swipe. In most cases the standard deviations of the errors (presented as the second term in the form mean ± standard deviation) is larger than the mean error value. In general, high standard deviation indicates that the magnitude of the deviation between the $F0$ estimates of an algorithm and the ground truth $F0$ fluctuates substantially across samples. On the contrary, low standard deviation suggests that the deviation of the $F0$ estimates of an algorithm compared to the ground truth $F0$ does not fluctuate considerably around the quoted mean value (hence, we can be more confident that the mean error is a good representation of the algorithm's $F0$ estimates compared to the ground truth $F0$). Therefore, low standard deviation of an $F0$ estimation algorithm suggests that the quoted mean error can be trusted more (in that sense, the algorithm can be considered more reliable). For example, swipe is not only noticeably more accurate in the database with the actual speech signals (lower mean error compared to the competing individual $F0$ estimation algorithms), but also more reliable (lower standard deviation). It could be argued that the good performance of swipe might merely reflect agreement with the sigma algorithm. However, the fact that swipe demonstrated overall excellent performance in both databases (one of which used data from synthetic speech signals generated by a sophisticated model where the ground truth $F0$ is known), and also that the “true” $F0$ in the database with actual speech signals was visually verified, strongly suggest that swipe appears to be very successful in accurately estimating $F0$ in sustained vowel /a/ phonations.

Figure 3 presents graphically the number of times each of the $F0$ estimation algorithms was closest to the ground truth $F0$ (for samples and also for signals in each of the two databases). However, these plots should be interpreted cautiously: first, the histograms do not quantify how much better an $F0$ estimation algorithm is compared to competing approaches for a particular sample (or signal); second, in those samples (signals) that an $F0$ estimation algorithm is not best, its estimates might deviate considerably from the ground truth and this is not reflected in the histogram. Therefore, although the plots in Fig. 3 illustrate nicely which $F0$ estimation algorithm was better than the competing algorithms for samples (signals), for the purposes of assessing the overall performance of the $F0$ estimation algorithms one should be primarily interested in the results reported in Tables I and II.

Overall, the time-domain correlation based approaches investigated here (yin, praat, rapt) perform considerably worse than alternative $F0$ estimation algorithms such as ndf and swipe. In their current implementations, yin, praat, and rapt are prone to producing large deviations from the ground truth. This finding may reflect the inherent limitations of the tools based on linear systems theory (autocorrelation and cross-correlation) used in yin, praat, and rapt. For example, autocorrelation is sensitive to amplitude changes.^{37} Moreover, autocorrelation and cross-correlation inherently assume that the contained information in the signal can be expressed using the first two central moments and are therefore suitable for Gaussian signals which may be embedded in noise; however, they fail to take into account nonlinear aspects of non-Gaussian signals. There is strong physiological and empirical evidence suggesting that speech signals (including research on sustained vowels) are stochastic or even chaotic, particularly for pathological cases.^{2,44,45} Therefore, nonlinear speech signal processing tools may be necessary to quantify some properties of speech signals. Interestingly, the two most successful $F0$ estimation algorithms in this study, ndf and swipe, rely on nonlinear properties of the speech signals to determine the $F0$ values. swipe identifies the harmonics in the square root of the spectrum and imposes kernels with harmonically decaying weights.^{36} Conceptually, the approach in swipe can be compared to kernel density estimation, which is widely considered one of the best non-parametric methods to estimate the unknown density of a continuous random variable, or the extension of linear concepts to nonlinear cases (for example, principal component analysis and kernel principal component analysis, or standard linear support vector machines and kernel based support vector machines).^{19} Therefore, introducing kernels at harmonic locations and weighting the entire harmonic spectrum to determine $F0$ may explain the success of swipe over competing $F0$ estimation algorithms which rely on standard harmonic analysis of the spectrum (for example, shrp). ndf is a combination of an interval based extractor (based on autocorrelations at pre-specified Gaussian filterbanks) and an instantaneous frequency based extractor, relying on a Gabor filterbank and the Hilbert transform (which promotes the local properties of the signal). The final $F0$ estimate for a particular signal segment is decided following weighting of the $F0$ estimates with a signal to noise ratio (SNR) estimation procedure (which is conceptually comparable to the SQIs introduced in this study). Effectively, ndf is trying to combine two different approaches in one algorithm: the standard linear autocorrelation approach with some modifications for incorporating SNR for each of the studied frequency bands, and a more complicated Hilbert-transform based weighting of Gabor filters (which is the tempo algorithm).^{39} The success of ndf may be attributed to incorporating information from both a modified weighted autocorrelation approach of the frequency band, and the Gabor filter Hilbert transform promoted estimates.

praat and rapt use dynamic programming, a potentially powerful optimization tool to determine the best $F0$ value among a pool of $F0$ candidate values for a particular signal segment (e.g., 10 ms as in this study), so one might expect these algorithms would provide accurate $F0$ estimates. However, dynamic programming in the context of $F0$ estimation is a *post-processing* technique which heavily relies on the determination of good candidate $F0$ values, and requires the careful optimization of a number of free parameters. In addition to the limitations of autocorrelation (praat) and cross-correlation (rapt), a further possible reason for the relative failure of praat and rapt is that the dynamic programming parameters have probably been optimized by the developers of the $F0$ estimation algorithms for running speech rather than for sustained vowels.

The results in Tables I and II demonstrate the adaptive KF approach consistently outperforms both the best individual $F0$ estimation algorithm and the simple linear ensembles. We stress that the adaptive KF improved the accuracy in correctly determining $F0$ estimates by 16% in the database with the synthetic signals, and 25.6% in the database with the actual speech signals. Notably, this improvement is not only significant, but also consistent in the vast majority of speech signals across both databases (see Fig. 4). Moreover, the adaptive KF is more reliable than the individual $F0$ estimation algorithms: in addition to exhibiting lower average deviation from the ground truth (reflected in the mean value of the error), the standard deviation around the mean value of the quoted error (e.g., RMSE) was consistently lower than competing approaches in all experiments. Furthermore, the KF approach was shown to be very robust (Sec. IV C): whereas the best individual $F0$ estimation algorithms (ndf and swipe) degraded considerably with increasing noise and lower sampling frequency, the KF approach was only marginally affected. Additional tests not shown in this study demonstrate that simple naive benchmarks such as the mean or median from the best subset of the $F0$ estimation algorithms is also considerably worse than KF.

We have investigated the robustness of the algorithms in two settings: (a) reducing the sampling frequency of the actual speech signals from 44.1 to 8 kHz and (b) introducing AWGN to the actual speech signals. Although the speech scientists' recommendation for voice quality assessment is that the sampling frequency should be at least 20 KHz,^{2} in practice we might not have adequate resources to record such high-quality signals (for example, when recording signals over the telephone). Overall, in both cases we found that there is small performance degradation in terms of accurate $F0$ estimation using most of the investigated $F0$ estimation algorithms; moreover we verified the robustness of the proposed adaptive KF approach, where the performance degradation in terms of estimating the true $F0$ values was practically negligible.

The current findings are confined to the sustained vowel /a/, and therefore cannot be generalized to all speech signals solely on the evidence presented here. It would be interesting to compare the $F0$ estimation algorithms studied here, including the approaches for combining the individual $F0$ estimation algorithms, for other sustained vowels (most relevant would be the other corner vowels, which are also sometimes used in voice quality assessment^{4}). Future work could also investigate more sophisticated combinations of $F0$ estimation algorithms to build on the promising results of this study.

The adaptive KF approach described in this study is an extension of the approach proposed by Li *et al.*^{20} We developed a new methodology using SQIs (which can be thought of as *algorithmic robustness metrics*) where the confidence in the successive estimates of the $F0$ estimation algorithms is directly used to update both the measurement noise covariance and the state noise covariance. This was achieved using prior confidence in the individual $F0$ estimation algorithms and taking into account their interaction in terms of difference of their estimates, and the difference with the *a priori* estimate which is assumed to be constant over successive time frames. We remark that our approach, where all sources are *collectively* used to feed the adaptive KF, is essentially different from the methodology by Li *et al.*^{20} where each source was introduced *independently* to the KF and the fusion of the different estimators was achieved in a subsequent step. The advantage of the new adaptive KF scheme is that we can jointly determine our confidence in the estimates of each $F0$ estimation algorithm by adjusting the SQIs, seamlessly integrating the entire process within the KF framework. The proposed methodology may find use in diverse applications relying on the adaptive KF, assuming the signal quality indices are suitably defined. For example, the presented methodology could be used in the applications studied by Li *et al.*^{20} (heart rate assessment) and Nemati *et al.*^{43} (respiration rate assessment). The adaptive KF is computationally inexpensive, and hence the proposed methodology may be useful also in real-time processing applications.

## ACKNOWLEDGMENTS

This study was partly supported by the Wellcome Trust through a Centre Grant No. 098461/Z/12/Z, “The University of Oxford Sleep and Circadian Neuroscience Institute (SCNi),” and partly from NIH Grant No. 1R43DC010956-01 which supported the data collection. M.Z. was supported by UTFSM and CONICYT, Grant No. FONDECYT 11110147. M.A.L. acknowledges the financial support of the Wellcome Trust, Grant No. WT090651MF.

## REFERENCES

*International Symposium on Nonlinear Theory and its Applications (NOLTA)*, Krakow, Poland (

*Proceedings of the IEEE Workshop on Application of Signal Processes to Audio and Acoustics*(