This review presents a highlevel overview of the uses of machine learning (ML) to address several challenges in spatial auditory display research, primarily using headrelated transfer functions. This survey also reviews and compares several categories of ML techniques and their application to virtual auditory reality research. This work addresses the use of ML techniques such as dimensionality reduction, unsupervised learning, supervised learning, reinforcement learning, and deep learning algorithms. The paper concludes with a discussion of the usage of ML algorithms to address specific spatial auditory display research challenges.
I. INTRODUCTION
The highfidelity threedimensional (3D) auditory rendering uses digital filters to model the path of sound to reach the ears. In the last few decades, the research in the spatial auditory field has advanced dramatically. It is widely agreed that virtual auditory displays that use individualized headrelated transfer functions (HRTFs, described later) produce the best and most accurate binaural audio reproduction.^{1} However, many unanswered questions still exist.
Machine learning (ML) arises as a possibility for researching HRTFs because it is a popular broad family of techniques that can be used to predict and classify existing data when no initial understanding of the causal relationships between specific components of the data exists. ML can be used as a tool to tackle some of the “black box” challenges in spatial auditory research. However, one must keep in mind that HRTF research is complex, and there are no simple, onesizefitsall ML algorithms to analyze the wide variety of spatial auditory content.
Human beings perceive a sound as emanating from a point in 3D space when the ears receive two different signals. In the natural environment, two cues largely affect the perception of spatial sound: interaural time difference (ITD) and interaural level difference (ILD).^{2}
These concepts have been known for over a century and are described by Lord Rayleigh's duplex theory^{3} and depicted in Fig. 1. In this theory, the ITD is defined as the difference in the sound's arrival time at each ear.^{4} The ILD is defined as the difference in the sound's level at each ear.
Spatial auditory display research advanced partly due to the rapid development of computer hardware and improvements in acoustic measurement techniques. These developments and improvements provide the necessary computational resources and theory for the empirical measurement, analysis, and synthesis of the auditory cues that influence spatial hearing.^{4} As a result, a sound's path to the listener's ears could be measured in carefully controlled settings. The HRTF is the acoustical transfer function that exists between a sound source and the entrance of a listener's ear canal. From a signal processing perspective, the HRTF describes the sound's magnitude in the frequency domain, whereas the headrelated impulse response (HRIR) is the time domain representation of the HRTF. Figure 2 depicts the HRTF measurement and synthesis.
In the HRTF measurement, sounds (for example, Golay codes, impulses, and sweep signals) are individually played and recorded at various locations around the listener at a fixed distance (see Fig. 3). Microphones in the listener's ears record the sound received at the eardrum. Because the sounds are meticulously measured in a soundproof room with carefully calibrated speakers and microphones, the ITD and ILD are readily observable within the HRIR. When transformed into the frequency domain, the HRTF exhibits complex characteristics, possessing many peaks and notches. In essence, the HRIR/HRTF contains information about the ITD and ILD.
The HRIRs are measured at various angles and elevations around the listener to make a complete virtual auditory reality environment. The HRTFs are unique to each person because sound waves scatter in a pattern that is unique to each listener's anthropometry. This interaction greatly influences how we perceive spatial sound. Using generic, nonindividualized HRTFs may cause perceptual distortions, meaning individualized HRTFs are necessary. Measuring individual HRTFs requires specialized equipment such as inear probetube microphones and access to an anechoic chamber. Because the process is both costly and time consuming, it is not suitable for all applications. Theoretical calculation methods exist, such as the snowman model^{6} and the boundary element method (BEM);^{7} however, these methods require large amounts of calculation that even the most stateoftheart hardware cannot handle cost effectively. Considering that the HRTF data are highdimensional and nonlinearly separable, many researchers are currently seeking a method to individualize the spatial hearing in a more efficient and predictive way. This paper highlights the research and findings of many of these investigators.
Practically, it is impossible to directly measure the HRTFs for each virtual auditory reality user. Researchers are continually seeking faster and easier solutions to obtain accurate and individualized HRTF measurements. Unlike the deterministic approach traditionally used for most programs and theories in computer science, the ML approaches discussed in this paper use computational methods to obtain information from data without relying on a predetermined equation as a model. The algorithms determine meaningful natural patterns from the data and use the trained model for the future prediction. The present work explores some of the efforts researchers have made in applying ML techniques in the HRTF research.
A. Motivation
Spatial auditory research, specifically, that which uses the HRTFs, is complex with many unanswered questions. ML is becoming a more popular research tool to address those challenges, including

HRTF decomposition: What are the most critical components of the HRTFs. Which elements affect the perception of spatial cues?

HRTF prediction: Can a person's HRTF be predicted from a database of other premeasured HRTFs and anthropometric features?

HRTF categorization: Do HRTFs fall into distinct groups with similar qualities? Is it possible to quantify the HRTF variation across the general population?

Unknown HRTFs: Is it possible to use premeasured HRTFs for known locations to predict the HRTF of unknown locations?

Customizing and improving HRTFs: Is it possible to modify a known HRTF to improve the perception and localization quality?
In general, ML is a branch of artificial intelligence (AI) that allows computer systems to learn directly from examples, data, and experience. Enabling computers to perform specific tasks intelligently, ML systems can perform complex processes by learning from the data.^{8} To some degree, ML involves tools and techniques from the field of statistics. ML is instrumental in HRTF research to find underlying relationships within and across the HRTFs, map HRTF datasets from higher to lower dimensions, and predict individual HRTF preferences based on anthropometric measurements within the HRTF datasets.
B. HRTF data
As previously mentioned, directly measuring the HRTFs is an expensive, timeconsuming process; however, ML requires an adequate amount of data to make meaningful predictions. Fortunately, many researchers have shared their HRTF measurement data, but each research laboratory's collection method differs, making a comparison across databases and aggregating data extremely challenging. Several publicly available, widely used datasets include:

CIPIC^{5} is a widely popular HRTF dataset provided by researchers at the University of California, Davis, which includes HRIRs for 45 subjects at 25 different azimuths and 50 different elevations (1250 directions). The measurements are taken at approximately 5° increments from a distance of 1 m. In addition, the database contains anthropometric measurements for each subject.

The LISTEN HRTF database,^{9} constructed by IRCAM, contains the HRTF measurements of 51 subjects with 187 positions and elevations from −45° to 90° spaced roughly 15° apart from a distance of 1.95 m. The subjects' anthropometric data were measured in the same manner as in the CIPIC database.

The KEMAR Massachusetts Institute of Technology (MIT) HRTF dataset^{10} is an extensive set of HRTF measurements taken from a KEMAR dummy head at MIT. The measurements consist of the left and right ear impulse responses. A total of 710 different positions were sampled at elevations from −40° to 90°.

The FIU DSP (Digital Signal Processing) laboratory HRTF database^{11} is a set of HRTF measurements taken from 15 subjects at Florida International University. The measurements were taken at six azimuths on the median plane and siiix elevations from −36° to 54°. The 3D models of each subjects' ear are also included within the database.

The MARL HRTF repository^{12} from New York University (NYU) consists of 113 datasets from the previously mentioned CIPIC, LISTEN, FIU, and MITKEMAR datasets. All of the data were converted to a new standard MARLNYU file format to allow for smooth comparison and usage among the sets originating from different databases.

SYMARE^{13} is a database provided by the University of Sydney. The database contains the HRIR measurements of 61 subjects with 10 available freely. The measurements were taken from 393 directions around the head from a distance of 1 m. This database also provides multiple highresolution 3D mesh models of each listener, which are obtained from the magnetic resonance imaging (MRI) data and the corresponding simulated HRIR data.

The ITA database^{14} is provided by the Institute of Technical Acoustics. The database contains the HRTF measurements of 48 subjects taken at approximately 5° increments from a distance of 1.5 m. This database also provides detailed ear models, which are reconstructed from the MRI scans for each individual subject.

The RIEC^{15} is a database provided by the Advanced Acoustic Information Systems Laboratory at Tohoku University. The database contains the HRTF measurements of 105 subjects, including 39 3D head and torso models. The HRTFs were measured at 72 azimuths and 13 elevations from a distance of 1.5 m.

The SADIE II HRTF database^{16} is provided by the University of York. The database contains the HRTF measurements of 20 subjects, including 3D head models. The HRTFs were measured at 15° elevations and variable azimuth resolutions from a distance of 1.2 m.

The ARI HRTF database^{17} is a database provided by the Acoustics Research Institute of the Austrian Academy of Sciences. It is currently the largest HRTF database, containing the measurements of over 200 subjects. The HRTFs were measured at 1550 positions at roughly 5° increments, including elevations from −30° to 80° from a distance of 1 m. The subjects' anthropometric data were measured in the same manner as those in the CIPIC and LISTEN databases.
Although these HRTF databases differ in structure and size, the data that are usually contained within them include the HRIR, which is represented in the time domain, the HRTF, which is the Fouriertransform of the HRIR, and the anthropological parameters represented as raw measurements or a 3D mesh of a user's head and/or torso. All of these data can be used to train ML algorithms. Because many of the public HRTF databases have between 20 and 200 listeners, the larger databases are more suitable for dataintense ML techniques to ensure more accurate predictions and classifications.
C. ML algorithms
There are many ways to categorize the ML algorithms and techniques. For example, they can be grouped according to the algorithm similarity as regression algorithms, regularization algorithms, Bayesian algorithms, neural network (NN) algorithms, deep learning algorithms, dimensionality reduction algorithms, etc.
This work uses a topical grouping structure, explaining the individual challenge within the HRTF research and the application of a specific ML technique, and discusses research findings and implications. The ML algorithms are gradually introduced as they are applicable to a particular topic. Generally speaking, there are three categories of ML algorithms: supervised learning, unsupervised learning, and reinforcement learning (displayed in Fig. 4). In supervised learning, the data points are labeled or associated with categories or values of interest. This approach is driven by the error between the target of the action and the action. The error is provided by an external agent, which is called the supervisor. In contrast, unsupervised learning does not require labeled data and is driven not by signals related to the error but by the experience of the task by the agent. An unsupervised learning algorithm aims to organize the data into categories or find their underlying correlation. Reinforcement learning is driven by positive or negative reward signals. A reinforcement system chooses an action in response to each data point, and the learning algorithm receives a reward signal based on a previous decision. Based on the feedback, the reinforcement learning algorithm iteratively updates its strategy to obtain greater rewards.
When deciding the type of ML algorithm to use for a particular dataset, a few factors should be considered. The general workflow consists of obtaining a dataset from an input source, training the model with the input data, and then using the model to reliably predict general data in the future. If the input data are labeled, then it is appropriate to use a supervised learning technique. If unlabeled data are being used, an unsupervised learning technique should be considered.
Regardless of the technique, after a model has been trained, it can be used to improve the prediction accuracy. However, some issues may be encountered concerning the trained model, as illustrated in Fig. 5. The first issue is overfitting. In this situation, the model works perfectly to predict the training data, but it describes the noise and errors rather than the underlying relationship among the data. Therefore, the model cannot be generalized to make predictions on the untrained data. This situation usually occurs when a model is too complicated, which can happen if the data have too many dimensions as compared to the number of total observations within the dataset. The opposite of overfitting is underfitting, in which the model is too simple and cannot describe the underlying relationship that governs all of the data, leading to poor performance. Underfitting can occur for several reasons such as having insufficient training data or applying inappropriate ML models to the data. Ideally, the goal is to find the model with the best capacity for generalization within an acceptable accuracy rate.
The structure of the later sections of this work is largely guided by the challenges described in Sec. I A. This paper is organized in a manner that introduces each aforementioned challenge within the HRTF research and discusses the ML algorithms that are used to address the issues. Section II discusses the ML techniques for determining the influential components of the HRTFs. Section III discusses the use of premeasured HRTFs to predict a new HRTF. Section IV discusses using ML to categorize the HRTFs into distinct categories. Section V discusses using ML to predict the HRTF of unknown and unmeasured locations. Section VI describes the limited research on using ML to improve the existing HRTFs. Oftentimes, the research topics, techniques, and challenges are interrelated. As such, each section may also include overlapping work pertaining to other HRTF research challenges or use previously discussed techniques. This work concludes with recommendations and potential directions for future research.
II. HRTF DECOMPOSITION
Before applying any ML technique to the HRTF research, it is important to determine the right features to use to represent the data and whether it is necessary to reduce the available dataset to a more appropriate size. In most realworld applications, the data are rarely described in the most informative feature space (dimension) and often require transformation to improve the signal and reduce the noise within it. The number of variables measured within each observation is described as the dimension of the data. In traditional statistical methods, models may fail, in part, due to an increased number of observations; however, a more significant source of failure is an increase in the number of variables associated with each observation.^{18} The HRTFs are considered highdimensional datasets, which present mathematical challenges and opportunities, and are bound to give rise to new theoretical developments.^{19} From an information theory perspective, important parts of the highdimensional data can be projected into a lowerdimension space without losing information.
When dealing with ML problems, under certain circumstances, the dimensions of the datasets need to be increased. Certain computerintensive methods can construct better predictive models using highdimensional data. For example, kernel methods can solve datasets that cannot be linearly separated in lowerdimensional feature spaces by mapping them to higherdimensional feature spaces. The dimensionality reduction will result in a loss of information from a dataset, but in many cases, not all of the features of the highdimensional datasets are directly related to the relevant underlying patterns in the data. In addition, some of the features may not have been directly measured. Before applying any model or algorithm to the dataset, learning a suitable lowdimensional manifold from the highdimensional data is an essential process in ML. Through dimensionality reduction, some redundant features will be filtered out, which also lowers the computational complexity.
As previously mentioned, the HRTF datasets are high dimensional, which may also suffer from the “curse of dimensionality” problem found in other ML models. According to the curse of dimensionality, variance, especially when contributed by meaningless noisy variables in highdimensional data, confounds the ML methods.^{20} The size of a HRTF filter can be greatly reduced by eliminating the noisy variables and less important features; however, the feature importance is a debated topic with many possible options. For example, relevant features could include the band power within specific frequency bands, the ITD, ILD, and locations of peaks and notches within the HRTF. In addition to helping with compression, reducing data to a more compact representation can potentially lead to improved generalization and classification, which will, in turn, reduce the computational time required for the HRTF filtering.
The remainder of this section will overview some methods for decomposing the HRTFs. The dimensionality reduction algorithms are depicted within Fig. 6 and grouped into two main categories: linear and nonlinear dimensionality reduction.
A. Linear dimensionality reduction: Principal component analysis and factor analysis
Because of its highdimensional nature, the HRTF data are commonly reduced to a lowerdimensional subspace for further processing. This reduction is an effective means of compressing data and follows the assumed model. The principal component analysis (PCA) is one of the best tools in the meansquare error sense and one of the most commonly used linear dimensionality reduction techniques. It is used to describe a dataset with only a few orthogonal components and corresponding weights. If a highdimensional dataset contains a large number of interrelated measures, the PCA can reduce the dimensionality while retaining most of the variation present in the data. The PCA aims to reduce the dimensions of a ddimensional dataset by projecting it onto a kdimensional subspace (where k < d). A complete algorithm can be found in the paper by Martens.^{21} Because the PCA is one of the easiest and most effective dimensionality reduction techniques, it has been widely adopted in the HRTF research.^{21–33}
For example, Martens^{21} presented some of the earliest and seminal research in applying the PCA to the HRTF modeling in a technique referred to as the spectral PCA. The work computed the principal components (PCs) from the critical bandfiltered HRTFs, which were measured from 35 source positions on the horizontal plane. In this work, the PCA was used as a data reduction method for the spectral variation in the logarithm of the directiondependent transfer function (DTF) magnitude associated with the changing direction. Once this was done, the HRTF analysis could be simplified to describe a few global spectral variations. Similarly, Kistler and Wightman^{22} and Middlebrooks and Green^{23} applied the PCA to the logarithms of the HRTF magnitudes to generate lowdimensional and orthogonal representations of the HRTF sets. The work by Kistler and Wightman^{22} exploits the PCA to reconstruct the HRTFs for individual listeners.
In the time domain, Hwang et al.^{28,29,34} modeled the median plane HRIRs from the CIPIC database as linear combinations of 12 basis functions (PCs). Their work used these weights to allow the listeners to interactively selftune the HRTFs. Each basis function was found to represent the pinna, shoulder, and torso effects. The 12 PCs were found to cover the interindividual and interelevation variations within a 5% error bound. Also, in the time domain, Fink and Ray^{31} investigated the effects of the PCs and associated weights on the horizontal plane HRTFs derived from 34 HRIRs within the CIPIC database. The PCs included the weighted spectral distortion. These experiments were designed to investigate whether the average HRTFs can be tuned to match the individual HRTFs. The subjects were not able to distinguish the difference between the full and reduced representation HRTFs.
In addition to using the PCA to reduce the HRTFs on the frequency and time domains, researchers also used the PCA in the spatial domain.^{24,32} Chen et al.^{24} employed the PCA to parameterize the HRTFs by expressing them as a weighted combination of a set of complexvalued eigentransfer functions (EFs). The EFs were an orthogonal set of frequencydependent functions. The weights applied to the EFs are functions of the spatial location, which are called spatial characteristic functions (SCFs). In addition, Xie and Zhang^{32} proposed a method to perform the PCA in the spatial domain, which is called the spatial PCA. The goal of the method was to reduce the dimensionality of the nearfield HRTF data. They found that using the spatial PCA accounted for 98.7% of the variance in the data.
The PCA is not only useful for the HRTF dataset reduction but also to reduce the amount of storage space and computation required to render HRTFbased auditory displays. For example, Keyrouz and Diepold^{26} proposed a binaural impulse response interpolation algorithm for synthesizing moving sounds. In the algorithm, the PCA is used to minimize the amount of storage space needed for the HRTF dataset by selecting several representatives from the whole dataset. With a reduced HRTF dataset, the proper matrix transfer functions or equivalently appropriate statespace realizations are created. In another sound source localization study by Keyrouz et al.,^{27} the PCA was also used to minimize the HRTF dataset storage space, which yielded a correct estimated localization rate of 42%–91%. This simplicity enables the fast localization of multiple sources and is simpler than other localization algorithms.
To understand the variation of the HRTF data across subjects, Jin et al.^{25} and Schönstein and Katz^{30} investigated the analysis of the HRTFs using morphological measurements. In this work, a subset of relevant morphological parameters is gleaned using a feature selection, which is then used to predict a subject's HRTF preference and understand how the HRTF perception is affected. In addition, Fan et al.^{33} demonstrated that the PCA decomposition is a valid tool, which has perceptual significance when analyzing the HRTFs.
Simply put, the PCA looks for the PC that accounts for the greatest variance in the data and ranks them. The underlying model is implicit. In contrast, another linear dimensionality reduction technique, called factor analysis (FA), explicitly attempts to model the underlying latent variables. It is a statistical method established by Thurstone^{35} to explain variability among the observed random variables in terms of fewer unobserved random variables (factors). The FA is closely related to the PCA. In fact, oftentimes, the two methods are confused for each other and their names are incorrectly used interchangeably. Admittedly, the FA also starts with the PCs. The components are rotated to improve their interpretive significance. The typical goal in the FA is to identify the variables that are related to one another and separate them from the others (a form of variable clustering).
In terms of the HRTF research, Xu et al.^{36} used the FA to identify the most representative anthropometric measurements from 17 data points from the U.S. Army Anthropometric Survey database (ANSUR).^{37} Three representative factors were then identified: the vertical factor, horizontal factor, and ear factor. The vertical factor represented the vertical measurements such as stature, cervicale height, suprasternale height, and neck height. The horizontal factors included the chest circumference, bideltoid breadth, chest depth, and shoulder circumference. The ear factor included the ear size. The stature, bideltoid, and ear length, were chosen as the representative measurements from the three factor groupings because they completely explained 62% of the variance in the data.
The Bayesian model selection (BMS) is another technique that can be used to identify important features for the ML model generation. The BMS is primarily used in contexts in which there are multiple competing models, and the goal is to select the best model. The BMS is also a useful feature selection tool. Botts et al.^{38} used the BMS for dimensionality reduction. The researchers used the Bayesian inference to select the most concise filter order and estimate the associated filter coefficients simultaneously. Their work demonstrated that if each HRTF is to be concisely represented, the same order should not be applied across the database. The model orders between five and ten produced the most concise models.
In the studies previously discussed, the PCA was validated as a simple, effective method to reduce the dimensionality of the HRTF datasets for modeling. The reduced datasets use less storage space,^{26,27} which can improve the interpolation performance and simplify the HRTF or HRIR analysis procedure. Even considering the dimensionality reduction, the PCA maintains an acceptable performance rate. Mathematically, the PCA is the simplest type of FA. It only requires the eigenvector extraction with no rotations or transformations. However, from a practical standpoint, the PCA is an observational approach and the FA is a modeling approach. Few studies employ the FA as a dimensionality reduction technique in the HRTF research. One reason for this observation is that the PCA is simple to implement. The FA as a statistical method is focused on testing theoretical models of the latent factors causing the observed variables. For simply reducing the correlated observed variables to a smaller set of important independent composite variables, the PCA is usually the first choice.
III. HRTF PREDICTION
Because the HRTF is idiosyncratic, many researchers have suggested using existing publicly available HRTF databases with corresponding anthropometric measurements for each subject to personalize the HRTFs.^{5,30,39–41} Methods like the PCA and other related linear models, such as the FA, assume that the data lie on a linear subspace of the input space. These linear algorithms maximize the variance in the linear subspace and then compute the maximum likelihood estimation. The algorithm seeks maximization of the retained variance in which the most important features are kept after the dimensionality reduction. However, the performance of the algorithm suffers when the data follow a nonlinear distribution or when outliers are present. Furthermore, some datasets still cannot be linearly separated after the dimensionality reduction. In addition to the linear reduction methods discussed in Sec. II, nonlinear dimensionality reduction is appropriate for the HRTF research, especially when categorizing highdimension anthropometric data.
A. Nonlinear dimensionality reduction
The isometric feature mapping (isomap),^{42} proposed by Tenenbaum et al., and the locally linear embedding (LLE),^{43} proposed by Roweis and Saul, are two representative nonlinear dimensional reduction algorithms that provide an implicit description of the mapping between the high and low dimensions. The LLE algorithm can be summarized as follows.
Let R^{d} be a Euclidean space containing a ddimensional domain Y, and let $ f : Y \u2192 R D$ be a mapping from this domain to a Euclidean space of a larger dimension, i.e., D > d. The nonlinear reduction aims to recover Y and f given N points in R^{D}. Given $ X = { x i \u2208 R D  i = 1 , \u2026 , N}$, find $ Y = { y i \u2208 R d  i = 1 , \u2026 , N}$ such that $ { x i = f ( y i )  i = 1 \u2026 , N}$.

Find K nearest neighbors for each data point x_{i}.

From the neighbors of the data point X_{i}, calculate the weights W_{ij} that best linearly reconstructs X_{i} by solving the constrained leastsquares problem, which adds up the squared distances between all of the data points and their reconstructions. If the data lie on or near a smooth nonlinear manifold of lower dimensionality, then there exists a linear mapping, consisting of a translation, rotation, and rescaling, that maps the higherdimensional coordinates of each neighborhood to the global internal coordinates on the manifold.

Calculate the lowerdimensional embedding vectors y_{i}, which are best reconstructed by W_{ij}.
The intrinsic geometric properties of each neighborhood are described by these weights. By design, the weights minimize the reconstruction errors that are invariant to rotation, rescaling, and translation of the data points. The same weights that reconstruct the data points in D dimensions should also reconstruct it in a ddimensional manifold.
Matching the anthropometric parameter from a baseline anthropometric feature database is one of the straightforward ways to personalize the HRTFs, and this technique has been adopted by several researchers.^{40,44,45} Middlebrooks^{46} proposed using frequency scaling of the HRTFs based on the anthropometric parameters. The method assumes a frequency shift in the individualized HRTFs and is produced by the different anthropometric features among the subjects. The anthropometric approach breaks down the inherent multicomponent (frequency, direction, and subject) nature of the HRTF. The success of this work suggests that a HRTF can be predicted for a given subject because the optimal HRTF can be estimated using the physical measurement of a subject's pinnacavity and head width.
In the isomap, the process of mapping from a highdimensional space to a lowerdimensional space has been transformed into a graph problem. By preserving each pairwise geodesic distance (the shortest distance traveled between any two points on the manifold), the isomap maintains the correlations among the neighborhoods of the data points. The algorithm can be summarized in the following steps:

On the highdimensional dataset X, construct a graph G(X,E). For each point Ξ, determine the set of neighbors either by taking the K nearest neighbors or finding all neighbors within a given radius. Then a neighborhood graph with each edge assigned a weight between two data points is constructed. The weight corresponds to the length of the edge.

Compute the shortest path between each pair of points in graph G to get the geodesic distance. After all of the geodesic distances between all of the data points have been calculated, store them pairwise in a matrix D_{G}.

Apply multidimensional scaling on D_{G} to construct the ddimensional embedding.
Grijalva et al.^{47} used the isomap to predict the lowdimensional HRTFs from the anthropometric features. The work proposed a method to customize the individual HRTFs based on their anthropometric features. This work exploits the existing correlations among the HRTFs across ears, directions, and individuals and then integrates the prior knowledge of binaural hearing into a single manifold for all of the subjects. In this work, the isomap was applied for the azimuth and elevation over the intraconic component of the HRTFs, resulting from a spectral decomposition, in which the HRTF intraconic components encode the most important cues for the HRTF individualization, leaving out the subjectindependent cues. The output was used in a NN to predict the lowdimensional HRTFs from the anthropometric features, which will be discussed in Sec. V.
The HRTF prediction is a challenging task, especially when the features used for the prediction are not known. Luckily, oftentimes in the realworld classification problems, data from the same category tend to be close to one another. The Laplacian score (LS) was introduced as a method of evaluating the importance of a feature by the power of its locality preserving projection (LPP). The LPP is a dimensionality reduction algorithm that builds a graph, incorporating neighborhood information of an ndimensional dataset. The Laplacian of the graph is used to compute a transformation matrix, mapping the data points to a subspace. This linear transformation preserves the neighborhood information optimally. The representation map created can be treated as a linear discrete approximation to a continuous map, which naturally arises from the geometry of the manifold. As the LPP is designed to preserve the local structure, a nearest neighbor search in the lowdimensional space will likely yield similar results as a nearest neighbor search in a highdimensional space.^{48} In the LS, all of the features are relevant, although some may prove redundant. The features selected in the LS that are close neighbors in the input space will also be close in the lowerdimensional space, preserving the data locality. The complete algorithm and justification can be found in He et al.^{49} After obtaining the LS of every feature, the most important features can be selected from the original dimensions, based on their scores.
Li and Huang^{50} used the LS and correlation analysis to predict the individual HRTFs according to the anthropometric parameters. The score was used to select the important anthropometric parameters among all of the measured parameters and the individual core tensor or output of the model. These parameters were then used as input for a personalization model incorporated with a NN. Similarly, Huang and Li^{51} used the LS to select the key anthropometric features. In their work, a tensor model was proposed to describe the HRTFs depending on the frequencies, sound directions, and anthropometric features. The model keeps the multidimensional structure of the measured HRTF. In their individualized HRTF modeling algorithm, the LS and correlation analysis are used for selecting the key anthropometric parameters, which are used in their multilinear regression input. The work found that the HRTFs that were predicted using this method elicited a similar performance in the listening tests as measured HRTFs.
Overall, in selecting the correct dimensionality reduction technique for the HRTF prediction research, a few tradeoffs should be considered. For example, with the isomap, the dimensionality reduction is transformed into a global graph problem and the correlation among the data points is described through the use of geodesic distances. The graph problem assumption relies on the existence of the wellsampled HRTF data. If the neighborhood contains holes due to missing data, the prediction performance will suffer. In addition, the isomap is sensitive to outliers and can be unstable, depending on the topology of the data. Thus, the isomap may present challenges when using the anthropometric features, which may have outliers. Local linear embedding preserves the locally linear configurations for the nearest neighbors and is more robust to outliers. Compared to the isomap, the LLE may have a better performance because it has a locally optimal nature. However, the LLE lacks the capability to describe the intrinsic correlation globally in the highdimensional space, which is a positive property of the isomap.
IV. HRTF CATEGORIZATION
One focus of the HRTF research is on categorization. If the HRTFs can be grouped into distinct categories, a variety of challenges can be solved. Putting listeners and the HRTFs into distinct groups can aid rapid HRTF customization, reduce the number of different HRTFs that a virtual auditory reality system needs to store to represent all of the listener preferences, and aid in understanding the listener perceptual variability. Unfortunately, the data are not always labeled categorically, which suggests the use of an unsupervised learning technique such as the kmeans or hierarchical clustering.
A. Unsupervised learning
Unsupervised learning is used to find the hidden patterns or intrinsic structures in the datasets. The technique can be used to derive structure from the data without knowing the effects of the variables. It is used to draw inferences from the datasets consisting of input data with unlabeled responses. The structure can be derived by clustering the data based on the relationships among the variables in the datasets.
kmeans clustering, a categorization algorithm, iteratively partitions the data into k mutually exclusive clusters. In each iteration, the data points are assigned to the cluster with the nearest mean. The data in a constructed cluster have greater similarities than the data in other clusters. In most implementations, the squared Euclidean distance has been used as the similarity measurement when clustering.^{52} The partition from the algorithm guarantees that the samples in the partition are closer to each other and farther from the data points in other clusters. The algorithm first chooses k centroids, then iteratively calculates and minimizes the sum of the distances from each of the points to the cluster centroid for all of the clusters. The points are moved between the clusters until the total sum cannot be further minimized.
In HRTF research, Fan et al.^{33} used kmeans clustering to determine if the PCA features on which the HRTFs were clustered have a perceptual linkage, as discovered through a subjective selection procedure. The work highlights the challenge of accurately decomposing a HRTF such that it can be represented as a set of points in space. The researchers found that the most informative HRTF representation is centered around the band energy. This finding suggests that the clusters are more distinctly differentiated when the PCA examines the HRTFs for one azimuth. Using hierarchical clustering (discussed later in this section), Lemaire et al.^{53} revealed groups of HRTFs with similar profiles. After this, the kmeans was used to cluster the HRTFs based on the HRTF positions expressed as Cartesian coordinates. One representative HRTF was selected from each cluster as part of the input for their NN model, which will be discussed in Sec. V.
Hierarchical clustering constructs a hierarchy of clusters that have two different types: hierarchical agglomerative clustering (HAC) and divisive clustering. In HAC, the clusters are formed from each data point, for example, a HRTF. Initially, each data point is considered its own cluster. The clusters are then merged pairwise from the bottom up, based on similarity, until the entire hierarchy tree is built. The merges are determined greedily by computing the similarity to all of the clusters and then merging with the most similar cluster. A dendrogramlike diagram is commonly used for representing the hierarchical clustering. One of the advantages of this approach is the flexibility to adjust the clustering granularity, thus, hierarchical clustering has been adopted as a popular analysis method by many researchers.^{53–56}
Andreopoulou and Katz^{54} used hierarchical clustering to analyze the HRTF similarity based on the user perception in a listening task. In their work, the participants listened to sounds generated using different HRTF databases and identified the perceived locations. Their responses were used to quantify the spatial distortion. They proposed a global perceptual distance metric that was derived from subjective evaluations of the binaural sound renderings that move along predefined trajectories in the horizontal and median planes. The evaluation performed several different hierarchical clusterings. The mixed results indicate that subjective measures of the perception are not reliable data points on which to cluster.
Xie et al.^{55} showed that hierarchical clustering on the binaural HRTF magnitudes could be used to analyze the similarity in the HRTF datasets. They measured the HRTFs of 52 participants and then clustered them into 7 groups with their HRTFs based on the magnitude. The distance, or dissimilarity in the HRTF magnitudes of each subject pair, was calculated based on the mean crosscorrelation coefficient. The HAC was used to cluster the HRTFs. In a localization study, spatialized sounds were played using the HRTFs within each listener's cluster and outside of the cluster. The results indicated that for most people, localization improved when the HRTFs from within their own cluster were used.
Another of the studies by Xie et al.^{56} used the HAC to cluster the individual HRTFs and analyze the similarity among the different frequency ranges. In their research, the individualized wavelet coefficients from a distance matrix were clustered into eight groups. Then the twolevel wavelet decomposition was applied to analyze the similarity of the individualized HRTFs in the different frequency ranges. Through this research, they concluded that the HRTFs of most of the subjects are highly similar in the 0–6 kHz range but vary significantly in the 6–12 kHz range and above, suggesting that the 6–12 kHz range is more important for the localization. Their work suggests that the HRTFs can be customized by replacing the wavelet components in the 6–12 kHz range in the general HRTFs from each cluster. This solution will greatly simplify the HRTF customization.
In deciding which clustering algorithm to use, there are some tradeoffs to consider. Both the kmeans and hierarchical clustering are useful unsupervised clustering techniques for the spatial auditory display research. The kmeans is easy to use and has a better computation time complexity compared to the hierarchical clustering. However, the results of kmeans clustering are not consistent. Also, the clustering results require a predetermined k cluster number. The HRTF clustering research is too young to speculate a definitive, feasible, agreedupon value for k. Even with the same k, due to variations in the centroid selection strategy (traditionally, random selection), the clustering results will sometimes still differ. In contrast, the hierarchical clustering will always construct the same hierarchical structure as long as it is given the split/merge decision rule and data. Another advantage of the hierarchical clustering is the flexibility of the clustering granularity. The clustering can be at any resolution level in the hierarchical tree structure. For the small datasets, such as the HRTF databases, the hierarchical clustering provides better flexibility and consistency. But if the size of the datasets is extremely large and the data distribution is hyperspherical (or circular in two dimensions), the kmeans is a useful and efficient tool. In the HRTF research, the choice of the algorithm is based on what kind of problem needs to be solved. Currently, most of the studies discussed, which used hierarchical clustering to analyze the HRTF similarity, had relatively small datasets. It is reasonable to expect an increase in the research and applications using both of these algorithms to solve clustering problems in the HRTF research.
In terms of the clustering data, using the HRTF magnitude data seems to be more promising than using the subjective participantreported measures. The participant selfreporting tends to be inconsistent with high subject variability. The HRTF magnitude data do not face such challenges and may be the better data choice. In addition, the binaural differences at various frequencies seem to be promising the data on which to cluster. Currently, to the authors' knowledge, research does not exist that clusters the HRIR or anthropometric data.
V. SUPERVISED LEARNING METHODS FOR UNKNOWN HRTFs
In ML, data are used to train a dataset and create accurate models of the underlying data. After training, the model can be used to predict new data that do not belong to the training dataset. The supervised learning algorithms, which differ from the unsupervised learning algorithms, use labeled data as the input for model training. This technique is useful for estimating the HRTFs for a new listener using only easily measured anthropometric data and estimating a listener's HRTF of a sound emanating from an unknown or unmeasured location.
Supervised learning considers an input variable (x) and an output variable (y) and uses an algorithm to learn the mapping function from the input to the output y = F(x). If the mapping function is well approximated, then any new input data substituted for x can accurately predict the output value of y. Supervised learning can be divided into two categories: classification problems and regression problems. The classification techniques predict discrete responses and are used in applications like junk electronic mail filtering, finding tumors in medical images, etc. The classification models separate the input data into distinct categories. Compared to the classification techniques, regression predicts the continuous data responses. Over the years, many supervised learning algorithms have been proposed. This section presents three representative algorithms: a support vector machine (SVM), Gaussian models, and NNs.
A. SVM
The SVM^{57} is a linear supervised learning model, which is widely used for data analysis, pattern recognition, and classification. The SVM uses a linear decision boundary (hyperplane) to classify the data. The data points of one class are separated from other data points by the hyperplane with the largest margin between them when the data are linearly separable (as shown in Fig. 7). If the data are not linearly separable, a loss function is used to penalize the points on the wrong side of the hyperplane. Through kernel transform, the SVMs can transform the nonlinearly separable data into higher dimensions where a linear decision boundary can be found, as depicted in Fig. 8.
In their 3DTV research, Tang et al.^{41} used support vector regression (SVR) to customize the individual HRTFs based on the anthropometric features. The researchers used the anthropometric featured defined and described by Algazi et al.^{5} in their analysis. Specifically, they used the 14 anthropometric parameters with the largest coefficients compared among the HRTFs at all directions and anthropometries. Those measurements were x1, x3, x6, x8, x9, x11, x12, x14, x17, d2, d5, a1, w1, and θ1. Algazi et al.^{5} first used the nonnegative matrix factorization to decompose the HRTFs into a lowerdimensional matrix. Then, they trained a SVM model using the anthropometric parameters and the previously constructed lowerdimensional HRTF matrix. Using the CIPIC database to test their model, they found that some of the anthropometric parameters had a greater influence than others on their regression model.
Schönstein and Katz^{30} used the SVM to select a subset of the listeners' most significant anthropometric parameters from the CIPIC HRTF dataset. The most significant morphological parameters were normalized and determined using the feature selection. The PCA was used to select the most relevant frequency range of the HRTFs, then a SVM model was used to train the data with a backward elimination method, iteratively eliminating the features based on the SVM classification output. The features remaining after the procedure were considered to be the most significant anthropometric parameters for the subjects associated with the HRTF database. The process predicted the listeners' preferred HRTFs significantly better than the random selection. This work supports the notion that the SVM models can be used to separate the HRTFs, and those separations have a perceptual significance to the listeners.
In the research by Huang and Fang,^{58} the SVR was used to construct a nonlinear regression model for the personalized HRIRs. The PCA was first used on the anthropometric features from the CIPIC database, and 12 important weight vectors correlated to the anthropometric parameters were selected. These weight vectors were then used as the output of the nonlinear model, whereas the anthropometric parameters were used as the input. After being trained on the CIPIC datasets, the model could be used to predict a listener's unknown HRIR, which is solely based on the anthropometric parameters. The prediction results outperformed a threelayer backpropagation NN model using the same data.
Wang and Chan^{59} used the joint SVR, which is slightly different from the algorithms of the previous studies, for the HRIR customization, which outperforms the original SVR algorithm. The authors employed their own twodimensional common factor decomposition (2DCFD) algorithm to reduce the HRIR dimensionality by representing the HRIR as a subjectdependent impulse response (SDIR) convolved with a common set of directiondependent impulse responses (DDIRs).
In all, using a SVM will yield a greater generalization capability and performance than many of the classical statistical learning algorithms. The fact that the SVM can be applied in a time series prediction and nonlinear regression makes it a good choice for the HRTF research. As demonstrated in the work presented in this section, a SVM is mainly used for predicting the unknown, individualized HRTFs and HRIRs using the anthropometric parameters. A standard SVM acts as a mapping from a highdimensional space to a onedimensional space. This makes it necessary to train several separate models, using one for each dimension. This mapping can make it necessary to train the model separately for each dimension/weighting of the HRTF/HRIR, which will cause some underlying correlations that cannot be exposed. However, the joint SVR model proposed in the research by Wang and Chan^{59} can be considered to be a good solution for such problems.
B. Gaussian models
In a complete, realistic virtual auditory reality environment, a listener should be able to perceive the position of a sound source at any arbitrary location within the 3D space. Unfortunately, the HRTFs are only sampled at specific discrete points in the 3D space, separated by as many as 15° in some cases. To simulate the unknown HRTFs of these unsampled locations, oftentimes, the HRTFs are interpolated. This approach may be adequate for the farfield sound sources, which can allow some inaccuracy; however, the resulting simulation cannot perfectly recreate the spatial cues for all of the possible source locations.^{60} The Gaussian models are an alternative approach that can yield a viable solution such as Gaussian models.
The SVM is a popular “kernel machine” learning algorithm and similar to the SVM, the Gaussian process (GP) is a generic supervised learning method, which is designed to solve the regression and probabilistic classification problems. By definition, “A Gaussian process is a generalization of the Gaussian probability distribution. Whereas a probability distribution describes random variables which are scalars or vectors (for multivariate distributions), a stochastic process governs the properties of functions.”^{52} In the GP model, observations occur in a continuous infinite domain. Every point in the input space is associated with a normally distributed random variable. Each finite collection of these variables falls into a multivariate normal distribution. Joining all of these random variables, of which there are infinitely many, is the distribution of a GP. This process assumes that all of the HRTFs (for a specific subject) are drawn from the same nonparametric latent function f. The covariance is also assumed in that the topologically similar directionfrequency inputs often produce similar magnitude HRTF frequency responses. This assumption coalesces with the observation that the HRTFs generally vary smoothly over the space and frequency coordinates.
For example, Luo et al.^{61} used a joint spatialfrequency GP regression model for the HRTF representation and interpolation. The model is based on the spatialfrequency relationship among the HRTF measurements. The work used the CIPIC HRTF measurements to infer the HRTF for a previously unseen location or frequency. In that work, half of the grid directions for a subject's HRTF were randomly chosen to comprise the measurement dataset. The GP regression model was used to infer the remaining grid directions. Their method achieved better interpolation and extrapolation accuracy than the existing spherical interpolation methods.
As mentioned previously, many of the HRTFs databases are publicly available for research,^{5,9–12,62} however, each measurement laboratory has its own specific setup and unique measured participants. Each database uses its own measurement grid, which significantly varies, making an HRTF comparative study nearly impossible. Few studies are able to compare a common subject or model interdataset variances. In response to this limitation, Luo et al.^{63} presented a joint spatialfrequency GP fusion method over different sphericalmeasurement grids for combining the common subject HRTF datasets via a set of linear transformations. This is critical because learning a set of transformations that captures the interlaboratory variances can help to generalize the variances among the measurement processes, which can then be applied to the HRTFs of the subjects unique to each database. Their experimental results indicated that the transformed datasets generalize target datasets better than the nontransformed cases.
In another study, Luo et al.^{64} used the GP model for sound source localization. They suggested that the model first learned a mapping between the binaural features that are invariant to the sound source (logmagnitude ratio, phase difference, average magnitude ratio, and magnitude pair) and the direction of a sound source. The model was then used to predict the location of an unknown sound source from the binaural cue observation. The trained model was tested using the CIPIC database. The GP model of Luo et al.^{64} outperformed other models such as the NNs and the least squares method. The advantage of this model is that during the training stage, it can learn the implicit HRTF model embedded in the parameters. Then using the implicit model, it can replace an explicit HRTF model. Admittedly, this method has some limitations. For example, it can only handle one broadspectrum sound source, such as white noise. As a solution, Deleforge et al.^{65} proposed an improved model based on Gaussian locally linear mapping (GLLiM). In their model, multiple sound sources can be localized simultaneously without relying on the sound source separation or monaural segregation. This research provides solutions and inspiration for future research on objective sound source localization.
In all, the GPs are optimal for the HRTF applications in which there exists an HRTF that varies smoothly over the space and frequency, and there is a need to compute the HRTF for a location that was not measured. Typically, interpolation is used in these cases; however, if the application has time and computational power, a GP should be used. In addition, a GP can be used to predict the location of a sound. This method is best for scenarios in which the actual sound is unknown and the signal has been observed by two separated microphones because binaural features that are invariant to the sound source can be used to determine the sound's location.
C. NN
NNs have also been used to predict the sound source locations, reduce HRTF dimensionality, and create customized HRTFs for an unknown person. A NN, sometimes referred to as an artificial neural network (ANN), can be defined as “a computing system made up of a number of simple, highly interconnected processing elements, which process information by their response to external inputs.”^{66}
An ANN is an interconnected assembly of simple processing elements called nodes. The ANN processing is facilitated in the internode connection strengths, or weights, obtained by the learning patterns from a set of training data. Before reaching the node, each input is multiplied by a weight. The weighted signals are summed to supply a node activation. Using threshold logic, the activation is compared to a predefined threshold. If the activation exceeds the threshold, the node produces a highvalue output (usually one) or it outputs zero. The ANNs can be thought of as a series of linear transformations, followed by nonlinearities.^{67}
For the ANN to classify a given set of input data, the decision is determined by the weight vector and threshold, which must be adjusted to produce the desired result. Generally, the adjustment is performed iteratively by repeatedly presenting examples of the required task and making small changes to the weights and thresholds to correct the classification. This process is called training, and the subset of examples used to train the ANN is called the training set.^{68}
In 1943, McCulloch and Pitts^{69} were the first to create a mathematical model for a NN. Jenison,^{70} a pioneer in the application of NNs to HRTF research, used a NN to approximate the parameters of a polezero model of an HRTF. Normally, the ANNs have been used to solve problems that exist in a multidimensional coordinate space. However, the HRTFs exist in a spherical (polar) coordinate system. As such, Jenison proposed a new ANN, called the von Mises basis function (VMBF) NN, based on the wellknown spherical inferential statistics function bearing the same name. This model was used to approximate a functional mapping from the spherical coordinates to the parameters of a polezero model. This method can be used to approximate the HRTFs of the unmeasured locations. In addition, the polezero model from this work can be used as an additional dimensionality reduction technique.
Palomäki et al.^{71} explored the use of NNs to simulate humans' perception of spatial sound. They proposed a binaural model in which 32 ITD and ILD values were generated and used as the training data for their NN model. The corresponding sound source directions were used as the output data. The trained model was then tested with different virtual sound sources. Their investigation yielded a poor performance when the model attempted to learn directions close to the median plane. This was likely because the ITD and ILD values alone did not provide enough information. From this work, one may observe that although it has been previously demonstrated that the interaural cues are optimal to predict the unknown HRTFs, reducing the dataset to the ITD and ILD values alone does suffice.
Lemaire et al.^{53} sought to use the NNs to synthesize the HRTFs for any user, given some HRTF measurements of the other users. In their work, the training data were an input vector derived from the HRTFs in the CIPIC dataset. The output vector is the target unmeasured HRTF for a user. Lemaire et al.^{53} wanted to define a function f as $ HRTF o \lambda = f ( HRTF 1 \lambda , \u2026 , HRTF n \lambda , \theta o , \varphi o )$, where the output vector is the desired, unmeasured HRTF, o, for a user λ, denoted as $ HRTF o \lambda $ and the input vector is composed of the azimuth θ_{o}, the elevation $ \varphi o$ of the target unmeasured $ HRTF o \lambda $, and one or more HRTFs ( $ HRTF 1 \lambda , \u2026 , HRTF n \lambda $) measured on user λ. Their work demonstrated that NNbased modeling can yield acceptable HRTFs for any user with an average error of less than 3 dB. It should also be noted that although there are 1250 HRTFs in the CIPIC database, a target HRTF can be created using only 50 of those HRTFs. This work is promising; however, one must caution that perceptual studies have not been employed to assess the effect on the localization performance or virtual auditory perception.
Correa et al.^{72} used a NN to localize an anechoic broadband spatial sound. To create the NN, the system extracted the ITD, ILD, and interaural spectral level difference (ISLD), which is the energy difference per frequency band between the signals arriving at the two ears. These extracted parameters were combined in a feature vector and used as the input of the NN. Then the network predicted the location of the sound sources on the median plane and with elevation. The system achieved a good performance with an average error of 5.3°. However, one limitation of the work is that the model only works for broadband sound sources and the localized sound is assumed to exist in an anechoic environment, which is not representative of realworld contexts and, thus, cannot be used to solve such problems.
Attacking this problem in the time domain, Hu et al.^{73} proposed a NN model for the individualized HRIRs. First, they selected key anthropometric parameters as the input for the NN. The model was trained in each direction with the selected anthropometric parameters, and the output of the model was the PCs of the HRTF and ITD at specific directions. The trained model was used to reconstruct the individual HRIRs, based on the new anthropometric parameters. Li and Huang^{50} also demonstrated the use of NNs for the HRTF individualization. They used a tensor to describe the dependent HRTFs for the frequencies, sound directions, and anthropometric parameters. The researchers first extracted the core tensor and anthropometric parameters among all of the measured parameters through theLS, which was described in Sec. III. These parameters were then used as the input for a personalization model based on a radial basis function (RBF) NN. The trained model was used to predict the HRTF preference based on a listener's anthropometric parameters.
In all, the NN is a classical supervised learning technique, which can be used to solve both the linear and nonlinear problems. As indicated in the relevant literature, using NNs in the HRTF research is increasing in popularity compared to the SVM applications, which are mainly focused on anthropometric features. When using NNs to create unknown HRTFs, there are several points to keep in mind. A NN is only as powerful as the data input. Using the ITD and ILD alone is not sufficient to predict the HRTFs. Other features (such as interaural magnitude differences at various frequencies) were demonstrated to improve the performance. In addition, because the HRTF data exist in a spherical system, the spherical statistical methods should be used instead of the linear methods. Many of these models were only observed using the HRTFs that were generated from sounds recorded in an anechoic environment. The researchers who wish to adapt these techniques to realworld applications should keep this in mind and consider other features to adapt their models.
VI. PERSONALIZING A HRTF
Generally, in the HRTF research, identifying a suitable HRTF for a listener is the end goal. However, this HRTF is not guaranteed to elicit accurate perceptual cues. A small amount of research has been devoted to improving a HRTF through personalization. This section will briefly discuss how reinforcement learning and deep learning have been used to personalize the HRTFs.
A. Reinforcement learning
Overall, the general nature of learning involves interacting with one's environment and observing the effects. For example, when an infant is learning about motion, vision, and hearing, there is no explicit teacher; however, infants have a sensorimotor connection to their environments. Iteratively practicing the connection generates feedback about causeandeffect and the consequences of their actions in the environment. Reinforcement learning operates in a similar manner in which the model is not explicitly told which action to perform. The learner must explore to discover the action that yields the greatest reward.^{74}
In the standard reinforcement learning model depicted in Fig. 9, the agent is connected to its environment. In each interaction with the environment, the agent receives input i and an indication of the current state s. Based on this input, the agent then chooses an action a to generate the output. The action changes the environment's state and the update is communicated to the agent through the reinforcement signal r. The agent chooses the behavior B and will choose actions that will maximize the longrun sum of the reinforcement signal. The iterations run through trial and error with different reinforcement algorithms.^{74,75}
Although few researchers study the application of reinforcement learning on the HRTF personalization, the results are quite promising. Morioka et al.^{76} proposed a model based on reinforcement learning to simultaneously customize and improve the individual HRTFs. In their system, a HRIR was modeled using an autoregressive moving average (ARMA). An outofhead sound source was presented to a simulated user to evaluate the localization accuracy. The actorcritic reinforcement learning algorithm was used to evaluate states based on the reward (spectrum distortion) and generate output. Their work proved that this method was effective in improving the spectral distortion such that the effective sound localization could occur; however, a listener's performance was simulated. Later, in their followup work, Nambu et al.^{77} tested their model on human subjects and confirmed the improvement of the localization accuracy for the learned HRTF.
Reinforcement learning has not (yet) been widely adopted in spatial auditory display research. A major challenge in reinforcement learning systems for the HRTF data is that many trials are needed to personalize the HRTF because the HRTF search space is large. One solution to this problem could be to constrain the search space to only contain the HRTFs of measured listeners whose anthropometric measurements closely match those of the experiment participant. As stated by the authors, the aforementioned work would also be improved by modifying the ITD and ILD in addition to the HRTF spectrum.
Overall, reinforcement learning is often challenged by the tradeoff between exploration and exploitation, a situation that never occurs in supervised learning systems. The agent in the system prefers actions that have already been tried and proven effective in producing rewards. However, to find such actions, the system must try actions that have never been used. Because neither exploitation nor exploration can be pursued exclusively, the agent has to try different actions (which need to be attempted more than once to yield a reliable reward) and find the best one. Such a dilemma is computationally intense, even without complex HRTF highdimensional datasets.
B. Deep learning
As mentioned previously, the NN applications for unsupervised and supervised learning have become increasingly popular, especially the subfield known as deep learning.^{78} Goodfellow et al.^{79} describe deep learning as follows: “The hierarchy of concepts that allows the computer to learn complicated concepts by building them out of simpler ones. If we draw a graph showing how these concepts are built on top of each other, the graph is deep, with many layers. For this reason, we call this approach [neural network] deep learning.” There has been extensive research in deep learning, and in recent years, numerous algorithms have been proposed and used in realworld applications. Deep learning has been explored and researched in relation to computer vision for years, but it is still a relatively new concept in the HRTF research.
Luo et al.^{80} proposed the first application of deep learning autoencoders to the HRTF research. Their goal was to personalize the generic HRTFs using the minimum amount of listening test probes. To achieve this, they developed lowdimensional feature representations using a denosing autoencoder, which learned the hidden representation vectors of the HRTF measurement inputs. The reconstructed output has smooth curves and removes some of the variance resulting from noise. The feature representations were applied to their proposed techniques for obtaining a user's HRTF. The model had five layers of stacked autoencoders and learned nonlinear feature spaces of the given HRTFs. The results suggest that deep learning can be used to personalize the nonindividualized HRTFs from latent feature spaces and then ultimately learn a reduced representation of the listener's HRTF.
VII. DISCUSSION AND FUTURE WORK
Sections II–VI presented current challenges in the HRTF research and then discussed how various ML methods are being used to solve those problems. This section briefly discusses the observations and thoughts about future areas of interest that are related to the ML applications in spatial auditory display research.
It should be noted that currently there is not a sufficient amount of research on the use of ML techniques in the spatial auditory to provide absolute, definitive recommendations for the best and worst techniques to address each of the HRTFrelated research questions. The utility of each technique depends on the context, data, application, and goal of the research problem. As such, the authors hope that the work described here spurs more interest in this line of research, elucidates current techniques being used, and inspires other researchers to apply ML techniques as they wish.
A. Discussion

Decomposition: One of the first steps when using ML for the HRTF data is decomposition (in instances in which the manual feature selection is desired). In the frequency domain, the PCA arises as the simplest and most common technique used by researchers, yielding acceptable results. Other research has identified significant anthropometric features for the decomposition as well.

Prediction: When using nonlinear dimensionality reduction to predict a HRTF, there are many tradeoffs to consider. The isomap reduces the data to a global graph problem; however, if the data are not well sampled, holes will appear and the prediction performance will suffer. Local linear embedding is not affected by the holes and is more robust to outliers. On the surface, the LLE seems to be an optimal solution; however, it cannot describe the intrinsic correlation, which the isomap can.

Clustering: In determining the commonalities among the HRTFs and clustering, there are also tradeoffs that one must consider. kmeans clustering is simpler and faster than hierarchical clustering; however, the results are inconsistent, which could lead to unstable listener experiences. Hierarchical clustering has granularity flexibility, whereas the kmeans requires the k number to be known ahead of time. Most HRTF datasets are small; thus, hierarchical clustering may be a more appropriate approach; however, as the available HRTF data increases, the kmeans is a more efficient and useful tool.

Calculating unknown HRTFs: To calculate unknown HRTFs from the anthropometric data, using the SVM is the most appropriate; however, due to the high dimensionality of the HRTF data, one may need to train several models (one for each dimension) to expose the underlying correlations. On the other hand, to calculate unknown HRTFs from other HRTFs, it is generally agreed that the GPs are optimal, especially because the HRTF data usually vary smoothly over the space and frequency. NNs are a promising technique; however, one must ensure that the input data are sufficient to determine the HRTF.

Improving HRTFs: Reinforcement learning and deep learning are lesserstudied methods for the HRTF research; however, the available results suggest that these methods are a promising area for further exploration.
B. Future areas of inquiry
1. Sufficient HRTF datasets
ML is all about data. Different ML models perform differently on the same dataset, and there is no silver bullet algorithm that can excel in all of the applications. In addition to learning algorithms, datasets are also an indispensable factor in the success of ML models. When considering the history of ML development, we find that there is a direct relationship between the appearance of large public datasets and subsequent research advances. In 2001, Banko and Brill^{81} demonstrated that for the problems in which different algorithms demonstrate virtually the same performance, adding more examples to the training set monotonically increased the accuracy of the model. Halevy et al.^{82} also support this conclusion concerning the importance of large public datasets.
In the past decade, large research datasets have become more commonplace, and deep learning has become more popular. As such, to continue the HRTF research and address its many challenges, generating large volumes of HRTF data is a critical necessity. In reference to the public HRTF databases listed in the Introduction, the size of the data samples contained in these datasets is far from sufficient. Without sufficient highquality training data, complex algorithms cannot be used to their full potential.
The future of ML in the HRTF research depends on the emergence of largescale public HRTF datasets. The ideal datasets will contain different sound sources (e.g., broadband white noise, speech, instruments, etc.) in different auditory scenes (small rooms, large rooms, open spaces, etc.) with different configurations for the signal acquisition [twodimensional (2D)/3D microphone array, inear binaural microphones with accurate head model measurements] and high accuracy localization information. Such databases are in high demand for research, including but not limited to virtual auditory localization, virtual reality, navigation, etc. Constructing such a database is exceptionally challenging. One possible solution is to combine all of the public databases into a commonly agreedupon format and require future databases to abide by this measurement format.
2. Exploring other ML techniques
There is a plethora of ML techniques and applications, however, researchers tend to gravitate toward applying ML techniques that other researchers have attempted and used successfully. More applications of a technique lead to more research in that area. This phenomenon seems to influence which techniques are regarded as the stateoftheart and shape the research landscape. As a result, the popular techniques may be regarded as more effective for solving a specific problem because many people use it. Because of this tendency, researchers should be moved to employ techniques in their work that may not have garnered as much popularity to assess their utility.
3. Future of deep learning
The ML research, especially work related to deep learning, has exploded in recent years. New deep learning architectures and algorithms are often proposed; however, few researchers have explored deep learning in the HRTF research. Compared to the long history of ML applications in the spatial auditory display, deep learning is still a new area of virtual auditory display research. There are many new deep learning methods and architectures worth exploring in the spatial auditory display applications. For example, the deep reinforcement learning algorithms used in AlphaGo^{83} may work for the dynamically approaching individual HRTFs. The generative adversarial networks^{84} proposed by Goodfellow et al., which are unsupervised learning systems, contain two NNs that compete against each other in a zerosum game framework. Transfer learning^{85} can also be applied in auditory settings for the training and transfer to specific individual settings. Although there are many ML techniques that can be explored in the spatial auditory display research, ensuring the availability of a sufficiently large amount of highquality training data is the most critical factor. As the prevalence of ML in the HRTF research has increased over the years, also, it is reasonable to expect the HRTF data and applications of new, cuttingedge algorithms to increase.