The efficient measurement of the threshold and slope of the psychometric function (PF) is an important objective in psychoacoustics. This paper proposes a procedure that combines a Bayesian estimate of the PF with either a look one-ahead or a look two-ahead method of selecting the next stimulus presentation. The procedure differs from previously proposed algorithms in two respects: (i) it does not require the range of possible PF parameters to be specified in advance and (ii) the sequence of probe signal-to-noise ratios optimizes the threshold and slope estimates at a performance level, $\varphi $, that can be chosen by the experimenter. Simulation results show that the proposed procedure is robust and that the estimates of both threshold and slope have a consistently low bias. Over a wide range of listener PF parameters, the root-mean-square errors after 50 trials were ∼1.2 dB in threshold and 0.14 in log-slope. It was found that the performance differences between the look one-ahead and look two-ahead methods were negligible and that an entropy-based criterion for selecting the next stimulus was preferred to a variance-based criterion.

## I. INTRODUCTION

The performance of a listener on a psychoacoustic task is conveniently characterized by the psychometric function (PF), which gives the listener's response as a function of a stimulus parameter such as intensity or signal-to-noise ratio (SNR). For example, a widely used PF in the field of speech intelligibility expresses the fraction of words that are correctly identified in a noisy speech sample as a function of the SNR of the speech sample (Miller *et al.*, 1951). The primary goal of a speech intelligibility test is generally to measure the speech-reception threshold ($SRT\varphi $), i.e., the SNR at which a particular fraction, $\varphi $, of words are correctly identified (Plomp and Mimpen, 1979; Wilson *et al.*, 2007). It is also, however, often important to measure the slope of the PF at the SRT (MacPherson and Akeroyd, 2014; Neuman *et al.*, 2010) and it is this that determines the perceptual benefit that can be obtained by improving the SNR.

This paper presents an adaptive procedure for measuring both the SRT and the slope of the PF. The procedure differs from previously proposed algorithms in two respects: (i) it does not require the range of possible PF parameters to be specified in advance and (ii) the sequence of probe SNRs is selected to optimize the estimates of the SRT and the slope of the PF at a performance level, $\varphi $, that can be chosen by the experimenter. The procedure makes efficient use of a limited number of measurements, has low bias, and is robust to listener variation. The primary application is in speech intelligibility testing but it is also applicable to other domains for which a suitable monotonic PF model exists.

### A. PF models

A number of sigmoid-shaped functions have been used to provide a parametric model of the PF including the Weibull, Logistic, cumulative Gaussian, and Gumbel distribution (Wichmann and Hill, 2001). A common choice for speech intelligibility is the four-parameter transformed Logistic model (Lam *et al.*, 1996; King-Smith and Rose, 1997; Shen and Richards, 2012) given by

where $\Psi (x)$ gives the proportion of words correctly identified at an SNR of $x\u2009\u2009dB$. The parameters $\alpha \u0303$ and $\beta \u0303$ determine the SRT and slope as discussed below while the remaining two parameters are the guess rate, *γ*, and the lapse rate, *λ*. In a context-free forced-choice experiment, *γ* is equal to the reciprocal of the number of alternative choices while *λ* is the probability of making errors at very high SNRs due either to indistinct stimuli, listener hearing impairment, or listener inattention. The slope of the PF at *x* may be calculated as

A model-independent parameterization can be obtained by selecting a convenient value of $\varphi $ (e.g., $\varphi =0.5$ or 0.75) and defining $\alpha \varphi $ and $\beta \varphi $ to satisfy

For any given $\varphi ,\u2009\alpha \varphi $, and $\beta \varphi $, the parameters $\alpha \u0303$ and $\beta \u0303$ in Eq. (1), respectively, can be found from the invertible relationships

and

For the special case of $\varphi =0.5(1\u2212\lambda +\gamma )$, these equations reduce to $\alpha \u0303=\alpha \varphi $ and $\beta \u0303=4\beta \varphi (1\u2212\gamma \u2212\lambda )\u22121$. The solid curve in Fig. 1 illustrates the PF generated by Eq. (1) using the parameters listed in Fig. 1. For very low and very high values of *x* the PF asymptotes are *γ* and $1\u2212\lambda $, respectively. In this example $\varphi =0.75$ and the circle on the PF identifies the point where $\Psi (\alpha \varphi )=\varphi $. The tangent to the PF at this point is shown as a dashed line and has the gradient $\Psi \u2032(\alpha \varphi )=\beta \varphi $. For clarity, the subscripts are omitted from $\alpha \varphi $ and $\beta \varphi $ in the remainder of this paper.

### B. Methods for estimating the PF

A widely used procedure for measuring PFs is the method of constant stimuli (McKee *et al.*, 1985). In this approach, the PF is sampled at a number of predetermined probe levels, and multiple tests are performed at each level in a randomized order to obtain the percent-correct score at each level. The approach has the advantage that it does not need to assume a specific PF model but it is normally less efficient than methods that use a parametric model for the PF, and so it may require a large number of trials in order to obtain a good PF estimate (Leek, 2001). Notwithstanding this, the method of constant stimuli can give reliable results in a reasonable number of trials especially if prior knowledge about the underlying PF is available.

To improve the efficiency of measuring the SRT and/or slope parameters, a large number of adaptive procedures have been proposed (Hall, 1981; King-Smith *et al.*, 1994; King-Smith and Rose, 1997; Brand and Kollmeier, 2002; Shen and Richards, 2012). Among the most popular are the up-down staircase procedures (Dixon and Mood, 1948; Levitt, 1971). The idea underlying these methods is to adapt the probe SNR at trial *n* according to the listener's response in the preceding trials so that convergence toward the SRT can be achieved rapidly. From an initial stimulus with a high expected intelligibility, the probe SNR is typically decreased or increased according to whether the listener's previous response was correct or incorrect. In order to ensure rapid convergence, the step-size governing the change in probe SNR can be adaptively selected (Kaernbach, 1991). An extension of this is the transformed up-down procedure (Levitt, 1971) in which the SNR change depends on the responses in two or more of the preceding trials. For example, with the two-down, one-up rule, the SNR is decreased after two successive correct responses, but increased after only a single incorrect response. By fitting a parametric model to the sequence of responses using a maximum likelihood criterion, it is possible to estimate both the SRT and slope at a chosen performance level. It has, however, been found that the slope estimates obtained in this way may have considerable bias (Leek *et al.*, 1992; Kaernbach, 2001; Treutwein and Hansstrasburger, 1999).

Wetherill (1963) derived expressions for the asymptotic variance of the maximum likelihood estimators of $\alpha \u0303$ and $\beta \u0303$ in Eq. (1) assuming that $\gamma =\lambda =0$. He showed that the variance was minimized when the probe levels, *x*, satisfied $\Psi (x)=0.5$ for $\alpha \u0303$ and $\Psi (x)\u2208{0.083,\u20090.917}$ for $\beta \u0303$. These optimum probe levels were subsequently denoted the “sweetpoints” of the PF by Laming and Marsh (1988). When wishing to estimate $\alpha \u0303$ and $\beta \u0303$ jointly, Wetherill (1963) proposed minimizing the product of their variances leading to sweetpoints satisfying $\Psi (x)\u2208{0.176,\u20090.824}$; he found that there was no benefit in probing at $\Psi (x)=0.5$ in addition to these two sweetpoints. O'Regan and Humbert (1989) generalized this work and determined the sweetpoint locations for estimating $\alpha \u0303$ and/or $\beta \u0303$ when $\gamma \u22600$ in Eq. (1).

In order to perform concurrent slope and threshold estimation, Lam *et al.* (1996) proposed an adaptive multimodal sampling scheme of the PF. Based on the analysis in Levitt (1971), four probe levels satisfying $\Psi (x)\u2208{0.02,\u20090.16,0.84,\u20090.98}$ were sampled in each iteration. The four resultant listener responses were used to update the PF model parameters using a nonlinear fitting procedure and the sweetpoint positions were then recalculated for the next iteration.

Instead of fitting a PF model to the observed data after every new listener response, Green (1993) sampled the space of possible PF parameters onto a fixed two-dimensional grid covering 4 values of *γ* and 120 values of *α* for a total of 480 grid points. After each listener response, the likelihood of each point on the grid was updated to a new value as described in Watson and Pelli (1983). The probe level at the next trial was then determined as the *α*-sweetpoint of the PF corresponding to the maximum-likelihood point on the grid. A similar approach was used by King-Smith and Rose (1997) with a two-dimensional grid in $\alpha \u0303$ and $log\u2009\beta \u0303$. Using the PF defined by the mean of the posterior distribution, the probe level at each trial was randomly selected from the two sweetpoints $\Psi (x)\u2208{0.12,\u20090.88}$. In Brand and Kollmeier (2002), the same technique was used in conjunction with two randomly interleaved adaptive-staircase procedures converging to the sweetpoints $\Psi (x)\u2208{0.2,\u20090.8}$ that were chosen as a compromise between the *α* and *β* sweetpoints. After each trial, the maximum likelihood PF was estimated, and the SNRs corresponding to the sweetpoints recomputed. Shen and Richards (2012) proposed a similar maximum-likelihood estimation of the parameters, but included the lapse rate, *λ*, as an additional unknown parameter. A sweetpoint for this extra parameter was derived, and the next observation placement was computed by selecting one of the the *α*, *λ*, or *β* sweetpoints of the current maximum-likelihood PF estimate. Two selection schemes were evaluated; one used an adaptive procedure while the other randomly chose between the sweetpoints, and was found to converge slightly faster to the SRT estimate.

In a Bayesian adaptive procedure, the search grid is viewed as a probability distribution over the parameter space. Before the first trial, the grid is initialized with a prior distribution that encapsulates any prior knowledge about the distribution of the PF parameters. Following each trial, the posterior distribution at each point on the grid is updated using Bayes' theorem [see Eq. (6)]. At the end of the experiment, the final parameter estimates may be taken as either the mean or the mode of the posterior distribution. Rather than selecting the probe SNR to lie at an estimated sweetpoint, it is chosen in King-Smith *et al.* (1994) and Gaubitch *et al.* (2010) to minimize the expected variance of the posterior distribution at the completion of the next trial (or alternatively the next two trials). A similar approach is adopted in the $\Psi $ procedure described in Kontsevich and Tyler (1999), but here the next probe is chosen to minimize the expected entropy of the distribution rather than its variance.

The present research extends the work on Bayesian adaptive methods and is outlined as follows. In Sec. II, the Bayesian adaptive framework for estimating PF parameters using an adaptive evaluation grid is presented. The placement of optimal degradation levels using both look one-ahead and look two-ahead methods is also derived. In Sec. III, simulations are run to test the efficiency and accuracy of the proposed estimation procedure under various test conditions. The simulation results are then discussed in Sec. IV, and conclusions are drawn in Sec. V.

## II. BAYESIAN ADAPTIVE ESTIMATION

In this section, a Bayesian method for estimating the PF is presented that extends the work of Kontsevich and Tyler (1999) and Gaubitch *et al.* (2010). Section II A details the estimation of the PF parameters from their posterior distribution and describes the adaptive rescaling and resampling scheme that is used in the algorithm. Section II B describes the next probe SNR selection based on the minimization of the expected cost after one or two predicted observations.

### A. Sequential Bayesian estimation

#### 1. Estimation procedure

In the following, $\theta =[\alpha ,\u2009log(\beta )]T$ represents the threshold and log-slope parameters of a PF where $log(\u2009)$ denotes a base-10 logarithm. Parameterizing the PF using the slope in the log-domain is convenient for several reasons: it is unbounded, its posterior distribution is found to be approximately Gaussian, and errors in the log-slope are approximately proportional to percentage errors in the slope. The parameter space is discretized with *K* grid points, $\theta k$, for $k=1,...,K$. The parameterized PF, $\Psi (x|\theta k)$, gives the probability of observing a correct response at an SNR of $x\u2009\u2009dB$ with PF parameters $\theta k$. The guess rate, *γ*, and lapse rate, *λ*, are assumed to be fixed in advance; the choice of *λ* is discussed in Sec. II C.

At trial *n*, a degraded speech sample at an SNR of $xn\u2009\u2009dB$ is presented to a listener and *r _{n}* = 1 or 0 according to whether their response is correct or incorrect, respectively. The row vector $zn=[x1,r1,...,xn,rn]$, which is always of even length, contains all the observed data up to trial

*n*and the discretized probability distribution, $P(\theta k|zn)$, gives the posterior distribution of the PF parameters after

*n*trials. After trial

*n*, the posterior distribution $P(\theta k|zn)$ is updated using Bayes theorem,

with

and $P(\theta k|z0)$ as the prior distribution.

Following trial *n*, the mean, $\mu (zn)$, and variance, $v(zn)$, of the parameter vector $\theta =[\alpha ,\u2009log(\beta )]T$ are given by

where, in Eq. (9), $(\u2009)2$ acts elementwise on a vector.

In intelligibility tests that use sentences as tokens, either a sentence may be scored as a single unit or else each content word in the sentence may be scored independently. In the latter case, the estimated PF relates to word intelligibility. Context effects, however, mean that the constituent words in a sentence are not perceived independently. This phenomenon may be modeled by modifying the expression for $P(rn|xn,\theta k)$ in Eq. (7) to take into account the number of correct and incorrect responses in the sentence together with the *j* factor described in Boothroyd and Nittrouer (1988).

#### 2. Approximation of the posterior distribution

The parameter space is sampled by a two-dimensional evaluation grid of possible values for $\theta $. If the grid were fixed, it would need to encompass the entire range of possible parameter values and be fine enough to provide adequate precision for accurate estimation of the PF parameters. In order to avoid imposing any prior restriction on the possible values of $\theta $, the evaluation grid is instead adaptively rescaled and resampled as required.

To obtain an updated evaluation grid, the current estimate of the posterior distribution $P(\theta |zn)$ is used to compute the mean, $\mu (zn)$, and standard deviation, $v(zn)$, according to Eqs. (8) and (9). The rescaled evaluation grid is then defined, for each component, to cover the range $\mu (zn)\xb14v(zn)$. The choice of ±4 standard deviations ensures (as a consequence of Chebyshev's inequality) that, regardless of the distribution's shape, at least 93.75% of its probability mass will lie within the search grid.

Whenever the grid is changed in either axis, it is necessary to interpolate the old values of $P(\theta |zn)$ onto the new grid. The method of evaluating $P(\theta |zn)$ for the rescaled grid depends on its relationship to the existing grid:

If the limits of the rescaled grid match those of the existing grid within ±10%, then no rescaling is performed. This avoids repeated interpolation by small amounts.

If the range of the rescaled grid lies entirely within the existing grid, then the new values of $P(\theta |zn)$ are calculated by quadratic interpolation in $log(P(\theta |zn))$. This form of interpolation is chosen because it has low computational complexity and is exact for a Gaussian distribution. The interpolation procedure is illustrated in Fig. 2 in which the left plot depicts $log\u2009(P(\theta |zn))$ for a grid covering the range $\alpha \u2208[\u22124,\u20092]$ and $ln(\beta )\u2208[\u22124,\u20090.5]$. In the right plot this has been interpolated in the $log(\beta )$ direction onto a grid covering $\alpha \u2208[\u22124,\u20092]$ and $ln(\beta )\u2208[\u22122.7,\u22120.6]$.

If neither of the previous cases apply, $P(\theta |zn)$ is recalculated from scratch from the values in $zn$. Although this recalculation entails additional computation, it happens relatively rarely since $v(zn)$ normally decreases with

*n*after the first few trials.

This adaptive rescaling and resampling of the evaluation grid serves two purposes. From a computational point of view, it allows us to keep the number of discretized values for $\theta $ to a minimum because the grid contracts around the mean, $\mu (zn)$, of the posterior distribution. This ensures that the method remains computationally tractable while still offering a fine discretization of the parameter space at convergence. The principal benefit of the technique is that it avoids the need to specify the range of possible parameter values in advance since the grid will automatically expand if required.

An example of the grid adaptation is illustrated in Fig. 3 which shows, as a function of trial number, the bounds of the evaluation grid (dashed lines) and the current SRT estimate. In this example, the true SRT is $30\u2009\u2009dB$, which lies outside the initial range for the SRT evaluation grid of $\xb120\u2009\u2009dB$. The first two responses result in an SRT estimate that approaches the upper limit of the evaluation grid and this causes the grid to expand rapidly to ensure that the SRT estimate lies near the center of the grid. From trial 7 onward, the grid contracts as both the SRT estimate and the evaluation grid converge together. The process is very stable and self-correcting. If, for example, a sequence of spurious early responses were to cause the evaluation grid to converge initially around a false maximum in $P(\theta |zn)$, subsequent responses would drive the estimated SRT to the edge of this grid, which would trigger it to expand sufficiently to include the true SRT.

### B. Next probe SNR selection

In order to obtain a reliable estimate of $\theta $ with as few trials as possible, the SNR at which to probe at the next trial must be chosen carefully so as to optimize the efficiency. Thus, the next probe SNR is selected in order to minimize a cost function, $c(zn)$, that measures the uncertainty in the estimate of $\theta $. Two alternative cost functions are considered here: (i) the variance of the $\theta $ estimate, $v(zn)$, which is computed from Eq. (9), and (ii) the differential entropy $h(zn)$, which is computed as

in which $P(\alpha k|zn)$ and $P(log(\beta k)|zn)$ are obtained by marginalizing $P(\theta k|zn)$ over $log(\beta k)$ and *α _{k}*, respectively. The factors $\Delta \alpha $ and $\Delta \u2009log\u2009\beta $ are the evaluation grid spacings and their inclusion ensures that the value of $h(zn)$ is largely unaffected when the grid is rescaled. Although Kontsevich and Tyler (1999) suggest minimizing the entropy of the joint distribution, $P(\theta k|zn)$, this does not guarantee that the marginal distributions of the individual components have low entropy or low variance. Accordingly, the component entropies are included separately in $h(zn)$ and the algorithm minimizes their weighted average as described in Eqs. (13) and (14). The entropy and the variance are both measures of how concentrated the distribution is around its mode or mean; the difference between them is that the variance penalizes large deviations from the mean more heavily.

#### 1. *Look one-ahead*

A set, *X*, of possible probe SNRs is considered at each trial; the choice of *X* is discussed in Sec. II B 3. For all combinations of potential probe SNR, $x\u2208X$, and listener response, $r\u2208{0,\u20091}$, the posterior distribution of $\theta k$ is found by using the Bayesian update

From this, the cost function, $c([zn,x,r])$ can be evaluated as either the variance, $v([zn,x,r])$, from Eq. (9) or the differential entropy, $h([zn,x,r])$, from Eq. (10).

The probability, $P(r|x,zn)$, of observing response *r* from probe SNR *x* is computed by marginalizing $P(r,\theta |x,zn)$ over $\theta $ as

The probe SNR for trial *n* + 1 is then chosen to minimize the weighted sum of the parameter costs in $c([zn,x,r])$. That is,

where

is a user-selected weight vector in which $\kappa \u2208[0,1]$ determines the relative weights given to the costs of *α* and *β*. Note that, since the SRT and log-slope have different units, the significance of the weights depends on the units chosen, as well as on the cost function that is used. The choice of the *κ* value depends on the relative precision requirements of the SRT and the slope and is made available to the experimenter.

#### 2. Look two-ahead

The method described above determines the next probe SNR, $xn+1$, by minimizing the expected value of the cost function after the next trial. Estimating the slope, however, requires test stimuli at two different locations along the PF (Levitt, 1971; Brand and Kollmeier, 2002; Shen and Richards, 2012). Although look one-ahead search may provide such probe SNRs, it cannot do so in an informed manner as the probe SNR to improve the $\theta $ estimate may not be the one that causes the biggest immediate decrease in the cost function. Instead, it seems reasonable to choose a pair of probe SNRs jointly to give the greatest expected improvement in the cost function after the next two trials.

Consider $(x,y)\u2208X\xd7X$ and $(r,s)\u2208{0,1}\xd7{0,1}$ to be the next two probe SNRs and their corresponding responses. A natural strategy would be to find the pair of probe SNRs, (*x*, *y*), that minimizes the expected cost after trial *n* + 2 and to select *x* as the next probe. This strategy is suboptimum, however, because it does not take into account that the second probe SNR, *y*, can be chosen after the response, *r*, to the first probe is known.

Instead, $d(zn,x,r)$ is defined to be the lowest possible expected value of the cost function after trial *n* + 2 given that the probe SNR and listener response at trial *n* + 1 are *x* and *r*, respectively. This may be calculated as

where $c(\u2026)$ is either equal to $v(\u2026)$ from Eq. (9) or to $h(\u2026)$ from Eq. (10) depending on the choice of cost function, and $P(s|y,[zn,x,r])$ is given by Eq. (12). The probe SNR for trial *n* + 1 should minimize the expected value of this quantity and so

It is worth noting that, although the look two-ahead procedure considers the effect of different choices for $xn+2$ in Eq. (15), it only actually makes a decision about $xn+1$. As will be seen in Sec. III C, the probe placements chosen by the look two-ahead method are, in practice, very similar to those chosen by the look one-ahead method. It appears that the expected reduction in the cost function is dominated by the choice made for $xn+1$ and that the influence of the speculative choice for $xn+2$ is less important.

#### 3. Candidate probe SNRs

In some circumstances the permitted probe SNRs are predetermined (e.g., if the test stimuli have been pre-recorded). In other circumstances, the stimuli can be generated on the fly at an arbitrary SNR. In this situation, a discrete set of candidate SNRs, *X*, is determined and one of these is then selected as the next probe SNR. To ensure that the optimum probe SNR is included in the evaluation, the set *X* should be chosen to encompass the range of SNRs corresponding to performance levels between 5% and 95% of the unknown true PF.

To ensure the desired range of probe SNRs is covered, *R* representative PF parameters sets, $\theta i$ for $i\u2208[0,\u2009R\u22121]$, are placed around an elliptical iso-probability contour at a distance of two standard deviations from the mean of the posterior distribution as illustrated by the small circles in Fig. 4. In this work, the value *R* = 8 is used to ensure that the contour is adequately sampled.

Using the eigen-decomposition of the covariance matrix of the posterior distribution with

where **D** is a diagonal matrix and **Q** is orthogonal, then $\theta i$ can be written as

where $.$ acts elementwise on a matrix and $i={0,..,7}$.

For each of the *R* representative PFs defined by the $\theta i$, the SNRs corresponding to the 5% and 95% correct scores are computed and the two extreme SNR values from this set are used to define the range that needs to be covered by the list of candidate probe SNRs. The candidate probe SNRs are then linearly spaced in dB within this range. Any external restrictions on the range of permitted SNRs can be imposed by the experimenter at this stage.

### C. Lapses of attention—Handling outliers

In the UML algorithm of Shen and Richards (2012), the lapse rate *λ* is optionally regarded as an unknown parameter of the PF that can be estimated alongside *α* and *β*. This capability can be especially useful as a diagnosis tool to estimate the intelligibility in quiet for hearing-impaired listeners. However, in a psychoacoustic experiment with normal-hearing listeners, we believe that the value of *λ* is likely to vary erratically over time. Accordingly, instead of considering *λ* to be a free parameter, it is given a fixed small value (e.g., 2%). Based on the evaluations in Sec. III, the precise choice of *λ* is unimportant unless it is very much less than the true lapse rate. To accommodate the possibility of a listener having an unusually high rate of attention lapses, the experimenter is given the option of re-computing the posterior pdf of $P(\theta |zn)$ from the saved experimental results while excluding outliers. This means that from the current estimate of $P(\theta |zn)$, incorrect responses at probe SNRs for which the PF exceeds, for example, 0.95 are marked as outliers and ignored when re-computing the posterior density.

## III. SIMULATION RESULTS

Two sets of Monte Carlo simulations were run in order to evaluate the proposed Bayesian framework under various test conditions and to compare it to competing methods. The first set evaluated six specific experimental setups and examined the convergence of the estimated PF parameters as a function of trial number, *n*. The second set of simulations fixed the number of trials at a realistic value of *n* = 50 and investigated the effect on performance of systematically varying the parameters of the true PF of a virtual listener.

Two competing methods were compared with four versions of the proposed estimation algorithm making six algorithms in all

The adaptive maximum-likelihood (AML) technique presented in Brand and Kollmeier (2002) using the parameter values given for the “adaptive procedure A2.” The adaptive level placement used in this method, described briefly in Sec. I B, is based on two interleaved staircase procedures that target a pair of sweetpoints, $\Psi (x)\u2208{0.2,\u20090.8}$, independently of the estimated PF. The authors adapted their procedure to the sentence scoring used in their evaluation example and so it has a somewhat different goal from the other algorithms presented in this study. The method was modified to include a prior distribution and used a fixed evaluation grid of 150 × 150 linearly spaced values for $\theta $ parameterized by SRT and log-slope.

The updated maximum-likelihood (UML) procedure presented in Shen and Richards (2012) with two-down, one-up stimulus placement strategy. The implementation used was taken from V2.2 of the toolbox presented in Shen

*et al.*(2015), with the*λ*parameter fixed at*λ*_{Alg}to ensure a fair comparison with other algorithms. The fixed evaluation grid was defined by 81 logarithmically spaced values for the slope and 81 linearly spaced SRT values. Note that this algorithm uses $\alpha \u0303$ and $\beta \u0303$ from Eq. (1) as internal parameters rather than*α*and*β*.The proposed Bayesian adaptive procedure using either look one-ahead (L1A) or look two-ahead (L2A) for the next probe SNR selection and using either an entropy-based cost function (L1A-Ent and L2A-Ent) or a variance-based cost function (L1A-Var and L2A-Var). The adaptive evaluation grid comprised 40 linearly spaced values for the SRT and 21 linearly spaced values for the log-slope. At each trial, 30 candidate probe SNRs were considered. The default value of the weighting factor

*κ*from Eq. (14) was 0.5. No rejection of outliers was used when computing $\mu (zn)$. The matlab implementation associated with this paper is freely available as the*psycest*function from the Voicebox Toolbox (Brookes, 2016).

All methods were initialized with probe SNRs in the range $(\u221220,20)$ dB, and evaluation grids covering the range $(\u221220,20)$ dB for *α* and $(0.01,0.5)\u2009\u2009dB\u22121$ for *β* on a logarithmic scale. A Gaussian prior $P(\theta |z0)$ was used, with a mean at the center of the evaluation grid and the standard deviation for each variable set to half the range of the grid. The algorithms used a Logistic PF model, *M*_{Alg}, with fixed parameters $\gamma Alg=0.01$ and $\lambda Alg=0.02$.

For each experimental setup and each estimation algorithm, $N=1000$ Monte Carlo simulations were run where each simulation comprised a sequence of 300 trials. At each trial, the algorithm under test selects a probe SNR and receives a binary-valued response from a virtual listener. The virtual listener's response is generated according to a “true” PF that is defined by a set of five parameters: *α*_{True}, *β*_{True}, *γ*_{True}, *λ*_{True}, and *M*_{True}. The PF model, *M*_{True}, was either a Logistic function [following Eq. (1)] or a cumulative Gaussian (Levitt, 1971; Hall, 1981). In order to assess performance, the estimation errors were computed as

where $\alpha \u0302n$ and $\beta \u0302n$ are the estimated parameters after *n* trials obtained from $\mu (zn)$ in Eq. (8). Values of $e\beta $ equal to ±0.1 correspond to percentage errors of $+26%$ and −20%, respectively, in the estimation of the slope. The root-mean-squared error (RMSE) and bias of parameter $*\u2208{\alpha ,\u2009\beta}$ are computed as

and

where $e*(n,\u2009i)=e\alpha (n)$ or $e\beta (n)$ for simulation run *i*.

### A. Convergence evaluations

The first set of experiments uses the six experimental setups shown in Table I whose true PFs are plotted in Fig. 5. For each setup, A–F, Table I lists the model parameters of the true PF (subscripted “True”) and the model parameters assumed by the algorithms (subscripted “Alg”). Setup A represents typical parameter values for the PF: at $\varphi =0.5$, the SRT is $\alpha True=\u22125$ dB and the slope is $\beta True=0.075\u2009\u2009dB\u22121$, the mean slope observed in MacPherson and Akeroyd (2014). The guess rate, $\gamma True=1%$, and lapse rate, $\lambda True=2%$, correspond to low token predictability and an attentive listener. In each of the remaining five setups, B–F, one of the model parameters is changed substantially while the others retain their values from setup A.

Setup . | $\varphi $ . | α_{True} (dB)
. | β_{True} ($dB\u22121$)
. | γ_{True}
. | λ_{True}
. | M_{True}
. | γ_{Alg}
. | λ_{Alg}
. | M_{Alg}
. |
---|---|---|---|---|---|---|---|---|---|

A | 50% | −5 | 0.075 | 1% | 2% | Logistic | 1% | 2% | Logistic |

B | 50% | −5 | 0.075 | 1% | 2% | Cumulative Gaussian | 1% | 2% | Logistic |

C | 50% | 25 | 0.075 | 1% | 2% | Logistic | 1% | 2% | Logistic |

D | 50% | −5 | 0.020 | 1% | 2% | Logistic | 1% | 2% | Logistic |

E | 50% | −5 | 0.075 | 1% | 10% | Logistic | 1% | 2% | Logistic |

F | 75% | −5 | 0.075 | 1% | 2% | Logistic | 1% | 2% | Logistic |

Setup . | $\varphi $ . | α_{True} (dB)
. | β_{True} ($dB\u22121$)
. | γ_{True}
. | λ_{True}
. | M_{True}
. | γ_{Alg}
. | λ_{Alg}
. | M_{Alg}
. |
---|---|---|---|---|---|---|---|---|---|

A | 50% | −5 | 0.075 | 1% | 2% | Logistic | 1% | 2% | Logistic |

B | 50% | −5 | 0.075 | 1% | 2% | Cumulative Gaussian | 1% | 2% | Logistic |

C | 50% | 25 | 0.075 | 1% | 2% | Logistic | 1% | 2% | Logistic |

D | 50% | −5 | 0.020 | 1% | 2% | Logistic | 1% | 2% | Logistic |

E | 50% | −5 | 0.075 | 1% | 10% | Logistic | 1% | 2% | Logistic |

F | 75% | −5 | 0.075 | 1% | 2% | Logistic | 1% | 2% | Logistic |

The convergence curves for setups A–F are shown in Fig. 6. Each column presents the results for one of the experimental setups and, in each case, the upper two plots give the RMSE of the SRT and slope, while the lower two plots give the corresponding mean bias. All quantities are plotted against the trial number, *n*, on a logarithmic scale where $5\u2264n\u2264300$. Each plot includes curves for four algorithms: (i) AML, Brand and Kollmeier (2002), (ii) UML, Shen and Richards (2012), (iii) L1A-Ent, and (iv) L2A-Ent. The latter two are the proposed look one-ahead and two-ahead algorithms, respectively, using an entropy cost function. For clarity, L1A-Var and L2A-Var are not shown in these plots. The four evaluated algorithms are distinguished by line color and/or line style according to the legend above Fig. 6. The true PF parameter values for setup A are shown in the top left plot; the other plots in the top row show only the parameter values that differ from setup A. For all plots $\gamma Alg=0.01,\u2009\u2009\lambda Alg=0.02$, and $MAlg=Logistic$.

The performance curves for setup A are plotted in the leftmost column. The upper plot in this column, demonstrates that all four algorithms have very similar convergence curves for SRT RMSE. The L1A-Ent and L2A-Ent algorithms converge slightly faster than AML and UML; the horizontal spacing between L1A-Ent and AML corresponds to a factor of up to 1.5 in *n*. The second plot in the column shows that for *n* > 70, the log-slope RMSE is very similar for all algorithms but that for low *n* the UML algorithm has the lowest RMSE. The lower two plots in this column show that for large *n*, all algorithms give unbiased estimates of both SRT and log-slope. The UML algorithm underestimates both the SRT and the slope for *n* < 70, whereas the other algorithms have near-zero SRT bias but overestimate the slope.

In setup B the true PF follows a cumulative Gaussian, whereas the estimation algorithms assume a Logistic function. The small difference between these is visible in Fig. 5 at an SNR of around 5 dB. From the similarity between the first two columns of Fig. 6, the model mismatch has little effect on any of the plots. The only noticeable difference is that the UML algorithm converges slightly slower in SRT and slightly faster in log-slope.

In setup C, shown in column three of Fig. 6, the true SRT is $+25\u2009\u2009dB$ and, because this value lies outside the search grid of the AML and UML algorithms, they are unable to converge. These two algorithms are omitted from plots one and three of the column because their errors exceed the plotted range. In contrast, the adaptive search grid of the L1A-Ent and L2A-Ent algorithms means that their convergence curves are almost the same as for setup A but with a very slightly delayed convergence.

In setup D, the true PF has a shallow slope of $\beta True=0.02\u2009\u2009dB\u22121$. As is clear from the top plot in column four of Fig. 6, the effect of this is that all four algorithms converge more slowly both in SRT (by a factor of 5 in *n*) and in log-slope (by a factor of 1.2 in *n*). The UML algorithm again converges fastest in log-slope but only for *n* < 30.

In setup E, shown in the fifth column, the lapse rate is increased to the high value of $\lambda True=10%$. All algorithms converge somewhat more slowly than for setup A; the AML algorithm is most affected for SRT RMSE and the L1A-Ent and L2A-Ent algorithms are most affected for slope RMSE. As can be seen from the lower two plots, all four algorithms now have biased SRT estimates and all but UML have biased log-slope estimates. At *n* = 300, the L1A-Ent and L2A-Ent algorithms have an SRT bias of $\u22120.4\u2009\u2009dB$ and a log-slope bias of −0.04, which corresponds to a 8% underestimate of the slope. The UML algorithm has the lowest bias: $\u22120.25\u2009\u2009dB$ in SRT and +0.01 in log-slope, which corresponds to $+2%$ in slope.

Finally, in setup F, the target SRT is changed from a level of 50% to a level of 75%, which, for speech intelligibility, is a more practically relevant threshold. The plots in the rightmost column show that the L1A-Ent and L2A-Ent algorithms are affected very little but that the AML and UML algorithms converge somewhat more slowly for *n* < 30 in both SRT and log-slope.

### B. Parameter sensitivity

The second set of experiments examines how the performance of each algorithm depends on the true PF parameters and on the weight, *κ*, from Eq. (14) when the number of trials is fixed at *n* = 50. Each column of Fig. 7 shows the effect of varying one parameter while the remaining parameters are fixed at the values from setup A in Table I. In each column the vertical dotted line corresponds exactly to setup A and, therefore, to the results in the leftmost column of Fig. 6 at position *n* = 50. Curves are plotted for six algorithms: (i) AML from Brand and Kollmeier (2002), (ii) UML from Shen and Richards (2012), (iii) L1A-Ent, (iv) L2A-Ent, (v) L1A-Var, and (vi) L2A-Var. The variance-based algorithms, L1A-Var and L2A-Var have been omitted from columns 1, 3, and 4 for clarity. The AML and UML algorithms are omitted from column 5 since they do not use the *κ* parameter.

The leftmost column of Fig. 7 shows the effect of varying *α*_{True} over the range $\xb120\u2009\u2009dB$, which corresponds to the range of the default search grid. The upper two plots show that, except for the UML algorithm, this has little effect on the RMSE of the SRT and the log-slope. For the UML algorithm, however, the RMSE of both SRT and log-slope rise rapidly when *α*_{True} approaches within $5\u2009\u2009dB$ of the upper edge of the search grid. The lower two plots in this column show that this is associated with a rapidly increasing negative bias in both SRT and log-slope.

The second column of Fig. 7 shows the effect of varying *β*_{True} over the range 0.01–$0.5\u2009\u2009dB\u22121$; this encompasses the range of slopes found by MacPherson and Akeroyd (2014). The upper plot confirms the earlier finding for setup D that all the algorithms converge to the correct SRT much more slowly when *β*_{True} is low. The second plot shows that this is also true for the log-slope although the effect is less pronounced than for the SRT. The variance-based algorithms, L1A-Var and L2A-Var, are more affected by low values of *β*_{True} than the other algorithms, which are very similar. Uniquely, the slope estimate of the UML algorithm is also affected adversely by high values of $\beta True>0.15$, which result in a large negative bias and high RMSE.

The third and fourth columns of Fig. 7 show that *γ*_{True} and *λ*_{True} have relatively little effect on the performance of any of the algorithms over the range 0.1%–10%. At high values of *γ*_{True}, the SRT estimate of the AML algorithm and the slope estimate of the UML algorithm become increasingly biased with accompanying increases in RMSE. At high values of *λ*_{True}, all algorithms have a negative SRT bias and an associated increase in RMSE although the L1A-Ent and L2A-Ent algorithms are affected less than AML and UML. The maximum plotted value of $\lambda =10%$ corresponds to the point *n* = 50 in the fifth column of Fig. 6.

Finally, the fifth column shows the effect of varying the weight, *κ* from Eq. (14), which controls the relative importance of SRT errors and slope errors in the cost function of the proposed algorithms. Varying *κ* has almost no effect on the bias of the SRT or the log-slope but, as would be expected, the SRT RMSE error increases for high *κ* and the slope RMSE increases for low *κ* because the relative rate of convergence is affected. The effect on the SRT RMSE is lowest for L1A-Var and L2A-Var and the effect on the slope RMSE is lowest for L1A-Ent and L2A-Ent. From these curves, there seems to be little value in changing *κ* from its default value of 0.5 and it should certainly not be lower than this.

### C. Probe placement

Figure 8 shows the histogram of probe SNRs for each of the 6 algorithms in all *N* 300-trial runs of setup A from Table I. The vertical dotted lines show the sweetpoints (Shen and Richards, 2012) for these PF parameters and the overall percentage of correct responses is also shown for each algorithm. The horizontal axis indicates the probe SNR in dB, and the corresponding PF probability is marked along the top of the graph. The AML algorithm targets the 20% and 80% points of the PF as a compromise between the SRT and slope estimation sweetpoints. The UML algorithm targets all three sweetpoints, which are recalculated after each listener response from the current maximum likelihood estimate of the PF. The values of $\Psi $ at these sweetpoints combined with the use of a two-down, one-up selection rule results in a 66% correct-response rate. The algorithm is biased toward the upper two sweetpoints so much that the broad peak centered on the lowest sweetpoint is barely visible in the histogram. It seems reasonable to suppose that this accounts for the negative SRT bias visible in the third row of Fig. 6 for *n* < 100.

The L1A-Ent and L2A-Ent algorithms result in very similar histograms that peak at $\Psi \u2248{22%,\u200976%}$ while the variance-based algorithms, L1A-Var and L2A-Var, have peaks that are even closer to the SRT at $\Psi \u2248{33%,\u200965%}$. For all four algorithms, the positions of the peaks are affected by the choice of *κ*: the peaks lie on the two outer sweetpoints when *κ* = 1 but as *κ* is reduced toward 0, the peaks move closer together until they coalesce into a single peak at the central sweetpoint. The variance-based algorithms have narrower peaks than the entropy-based algorithms and, because of the scaling effects discussed in Sec. II B 1, lie closer to the central sweetpoint for any given value of *κ*. This difference explains why, in Fig. 7, the variance-based algorithms have higher RMS errors in log-slope but lower RMS errors in SRT. All four algorithms, but especially the variance-based ones, select probe SNRs in the interval between the two histogram peaks more frequently than is the case for the AML and UML algorithms. There are no significant differences in probe SNR placement between the look one-ahead and look two-ahead algorithms and this is reflected in their almost identical performance curves. Because the probe placement in the proposed algorithms is concentrated closer to the SRT, it might be expected that the algorithms would have reduced sensitivity to model mismatch when it is more severe than in setup B. This hypothesis was not investigated in the current study.

## IV. DISCUSSION

Under most circumstances, all the evaluated algorithms converge to unbiased estimates of both SRT and log-slope for large *n*. The only exception to this noted in the simulations was when *λ*_{True} was much larger than the value assumed by the estimation algorithms; under this circumstance, all algorithms showed a negative bias in SRT and all but UML showed a bias in log-slope. We expect that a similar effect would result from a mismatch in *γ*_{True} but we did not include this in the tests since the correct value of *γ*_{True} is normally known. In all tests with a performance target of $\varphi =0.5$, the L1A-Ent and L2A-Ent algorithms converged fastest in SRT, whereas the UML algorithm converged fastest in slope. When the performance target was changed to $\varphi =0.75$, the convergence rates of AML and UML degraded and the L1A-Ent and L2A-Ent algorithms converged fastest in both parameters.

When the number of trials was restricted to *n* = 50, we observed some consistent differences in performance. The L1A-Ent and L2A-Ent algorithms almost always had a lower SRT RMSE than AML and UML: typically by ∼ $0.4\u2009\u2009dB$. Conversely, the UML algorithm usually had a lower log-slope RMSE than the other algorithms although this advantage disappeared for *β*_{True} outside the range $[0.05,\u20090.11]$.

Because they use a fixed search grid, the AML and UML algorithms cannot converge if the SRT or slope lie outside this predefined range; for these algorithms the initial search grid should therefore be chosen to be large enough to encompass all possible outcomes. In contrast, the adaptive search grid of the proposed algorithm means that it will converge to the correct values of SRT and slope even if they lie outside the initial search range. The UML algorithm gave poor slope estimates when the true SRT, *α*_{True}, or true slope, *β*_{True}, lay close to the upper limit of the search grid.

The UML (Shen and Richards, 2012) and $\Psi $ (Kontsevich and Tyler, 1999) procedures have been compared in the literature, and results generally show similar performance for both SRT and slope estimation (Shen, 2013; Shen *et al.*, 2015; Hatzfeld *et al.*, 2016). The $\Psi $ procedure corresponds closely to the look one-ahead entropy-based version of the proposed method (L1A-Ent). The results obtained from our simulations broadly agree with those reported by Hatzfeld *et al.* (2016) who found, like us, that the UML algorithm was rather less sensitive than $\Psi $ to high values of *λ*_{True} but recommended that *n* > 80 trials were needed to maximize the precision of the SRT estimate from the UML algorithm. Since all the algorithms use an essentially identical technique to estimate $P(\theta k|zn)$, differences in their performance are primarily a consequence of differences in the probe SNR placements whose histograms are shown in Fig. 8. Given the substantial differences in these histograms, it is perhaps surprising how similar the convergence curves of the six algorithms are.

The simulation results indicate that performance differences between the look one-ahead versions (L1A-Ent and L1A-Var) of the proposed algorithm and the look two-ahead versions (L2A-Ent and L2A-Var) are, in most cases, very small. This is reflected in the near-identical histograms in Fig. 8 and is consistent with the finding of King-Smith *et al.* (1994) who recommended on the grounds of computational complexity that a look one-ahead procedure should be used except for the case of two-alternative forced-choice (2AFC) experiments where the performance difference was significant. We found the lack of a performance difference surprising because we expected the look two-ahead procedures to make probe SNR choices that would result in faster slope convergence. It may be, as King-Smith *et al.* (1994) suggests, that there are some situations when this is true, but we have not discovered them. Although the look two-ahead procedure entails more computation, this is not a significant issue with current computers, which can easily perform the necessary computation in real time.

If *κ* is adjusted to account for the scaling effects discussed in Sec. II B 1, the differences between the entropy-based and variance-based cost functions are negligible for large *n*. For small *n*, the entropy-based cost function results in slightly better estimates of the log-slope but slightly worse estimates of the SRT. Overall, we prefer the entropy-based versions since their estimates of log-slope are more robust especially when *β*_{True} is small.

## V. SUMMARY AND CONCLUSIONS

In an extension to previous work on the Bayesian formulation of the PF estimation problem, a new algorithm has been presented that uses an adaptive evaluation grid for the PF parameters and optimal presentation level selection using either a look one-ahead or look two-ahead procedure. The procedure converges slightly faster than competing algorithms in SRT estimation but slightly slower in log-slope estimation and both estimates have a very low bias. The procedure differs from previously proposed algorithms in two respects: (i) it does not require the range of possible PF parameters to be specified in advance and (ii) the sequence of probe SNRs optimizes the SRT and log-slope estimates at a performance level, $\varphi $, that can be chosen by the experimenter rather than being fixed at $\varphi =0.5$. The proposed procedure is robust to variations in the listener PF and, over a wide range of listener PF parameters, the RMS errors were $1.2\u2009\u2009dB$ in SRT and 0.14 in log-slope for a 50-trial measurement. It was found that the performance differences between the look one-ahead and look two-ahead methods were negligible and that an entropy-based criterion for selecting the next stimulus was preferred to a variance-based criterion.

## ACKNOWLEDGMENTS

The research leading to these results has received funding from the EU 7th Framework Programme (FP7/2007-2013) under Grant No. ITN-GA-2012-316969, and from the Engineering and Physical Sciences Research Council under Grant No. EP/M026698/1. Data associated with this paper are available online.^{1} The authors would like to thank the anonymous reviewers for their helpful comments and suggestions.

^{1}

http://www.commsp.ee.ic.ac.uk/~sap/projects/psycest/ (Last viewed March 30, 2017).