Multi-point room equalization (EQ) aims to achieve a desired sound quality within a wider listening area than single-point EQ. However, multi-point EQ necessitates the measurement of multiple room impulse responses at a listener position, which may be a laborious task for an end-user. This article presents a data-driven method that estimates a spatially averaged room transfer function (RTF) from a single-point RTF in the low-frequency region. A deep neural network (DNN) is trained using only simulated RTFs and tested with both simulated and measured RTFs. It is demonstrated that the DNN learns a spatial smoothing operation: notches across the spectrum are smoothed out while the peaks of the single-point RTF are preserved. An EQ framework based on a finite impulse response filter is used to evaluate the room EQ performance. The results show that while not fully reaching the level of multi-point EQ performance, the proposed data-driven local average RTF estimation method generally brings improvement over single-point EQ.

## I. INTRODUCTION

Digital room correction, also known as room equalization (EQ), seeks the improvement of the perceived sound quality in a listening environment influenced by the acoustical and geometrical factors including the room size and shape, wall properties, floor carpeting, furniture, and loudspeaker positioning. This is achieved by applying digital filters such that the difference between the acoustic system response and a target curve is minimized across the spectrum.^{1,2}

EQ filters are designed based on the room impulse response (RIR), or equivalently the room transfer function (RTF), measured between the loudspeaker and the positions in the listening area. Depending on the number of RTFs involved in the filter design, room EQ techniques can be split into two classes: *single-point EQ*^{1,3–6} and *multi-point EQ.*^{7–14} The application of single-point EQ filters may lead to poor performance resulting in perceptually undesired sound quality, since RTFs vary significantly across the room with changing characteristics in different frequency ranges. This implies that the applied EQ may only be effective at the given measurement point. To overcome this limitation, multi-point EQ techniques rely on multiple RTF measurements collected within the listening area in the filter design. However, the need for the measurement of RTFs at multiple positions may become a laborious procedure when a commercial application is considered, making it susceptible to additional errors due to potential faulty measurements conducted by the end-user. In this manuscript, we introduce a data-driven approach that estimates a local average RTF from a single-point RTF measured between the loudspeaker and the listener position in an effort to achieve a room EQ performance comparable to the multi-point EQ when the EQ filter is designed using the estimated local average RTF.

A review of the multi-point EQ methods proposed in the literature is presented here. Miyoshi and Kaneda^{7} proposed a method for the inverse filtering of multi-point RIRs, where they assumed that the measured RIRs had no common zeros and there were more loudspeakers than microphone positions. Elliott and Nelson^{8} developed a multi-point EQ framework, where adaptive filtering was employed through the minimization of the least squares error between the equalized signals recorded by multiple microphones and the delayed input signal. Haneda *et al.*^{9} proposed an FIR-based multi-point EQ technique using the common acoustical poles of multiple RTFs to suppress the peaks arising from the room resonances. Bharitkar *et al.*^{10} proposed a multi-point EQ method using fuzzy *c*-means clustering, in which similar RIRs were grouped into clusters to obtain a time-domain general prototype response and showed that the multi-point EQ performance was improved compared to the simple time-domain average over RIRs. Cecchi *et al.*^{11} compared the fuzzy *c*-means RIR clustering with different frequency-domain prototype designs including the mean-magnitude and the root mean square RTF averaging, the minmax and median RTFs, and reported that all the techniques achieved similar multi-point EQ performances. Carini *et al.*^{12} developed a frequency-domain implementation of the fuzzy *c*-means clustering for multi-point EQ with reduced computational complexity and showed with subjective listening tests that both the frequency-domain fuzzy *c*-means and the mean-magnitude RTF averaging received overall good performance ratings. Hess *et al.*^{13,14} tested different microphone movements to obtain a spatial weighted mean-magnitude RTF within a human head movement area for car-cabin EQ. Pepe *et al.*^{15,16} proposed a deep-learning based multi-point EQ approach for parametric IIR filter optimization using a neural network-based differentiable IIR filter design, and tested it in room and car-cabin scenarios. Pedersen^{17} demonstrated that the root mean square average of several RTFs measured at randomly selected positions across a room yields the total acoustic power radiated by the loudspeaker. The same author also proposed the use of this global information on the sound field acquired via spatially averaged RTFs to calculate the gain limiters for the local equalizer designed based on the RTF measured at the listener position.^{18}

At low frequencies, the room itself becomes the key acoustic player affecting the sound quality, resulting from the room modal behavior and the loudspeaker coupling that depends on its positioning relative to the room boundaries.^{19–22} This may lead to substantial spectral variations between different RTF measurement points. The room effects gradually diminish beyond the Schroeder frequency^{23} and the loudspeaker properties such as its frequency response and directivity determine the perceived sound at the listening area towards high frequencies, resulting in the smoothed spectrum of distinct locations in a room having relatively small differences in magnitude.^{21,22} Methods have been proposed in the literature for low-frequency room equalization.^{24–26} Mäkivirta *et al.*^{24} developed a modal equalization framework, which controls the modal decay rates by either modifying the audio signal of the primary loudspeaker or playing an additional sound from a second loudspeaker. Welti and Devantier^{25} demonstrated that using multiple subwoofers reduced the seat-to-seat variation over the spectrum, enabling a more effective low-frequency EQ over the listening area. Kolundžija *et al.*^{26} proposed a method for low-frequency room EQ over a broad listening area, where the response of a primary loudspeaker was equalized *via* auxiliary loudspeakers and measuring multi-point RIRs of each loudspeaker. In a related research area called sound field reconstruction, model-based^{27,28} and data-driven methods^{29} have been proposed to estimate from a limited number of RIR measurements the sound field of a room in low frequencies, which may potentially then be used in multi-point room EQ.

This work presents a data-driven local average RTF estimation method operating at low frequencies to simplify a multi-point room EQ procedure, by reducing the requirement of multi-point RTF measurements down to only a single-point RTF measured at the listener position. The main contributions include the generation of millions of simulated low-frequency RTFs via the finite element method (FEM), the measurement of thousands of real RIRs with a robotic arm, and the design of a convolutional encoder-decoder network architecture for the data-driven estimation of a local average RTF from a single-point RTF, adapted from the prior network designs initially proposed for speech enhancement.^{30,31} The proposed method is validated in an extensive number of rooms with diverse geometric and acoustic properties using an FIR filter-based room EQ framework.

The remainder of this paper is structured as follows. Section II presents an overview of the proposed data-driven method. Section III details the generation of simulated and measured data. Section IV explains the deep learning framework including the dataset and the network architecture. Section V describes the room EQ framework. Section VI presents the local average RTF estimation results and evaluates the room EQ performance. Section VII concludes the manuscript.

## II. METHOD OVERVIEW

The proposed framework for data-driven local average RTF estimation from a single-point RTF is summarized in Fig. 1. The reference RTF used as the “single-point RTF,” from which the local average RTF is to be estimated, is fed into the deep neural network (DNN) as the input. During the network training, the additional RTFs surrounding the reference RTF position are included to compute the multi-point average-RTF, or in other words, the target RTF, to which the network output is compared through a loss function in a supervised fashion. After training, the network output, referred to as the “network RTF,” is fed to a room equalizer to obtain the EQ filter.

For multi-point averaging, a measurement grid consisting of seven points is chosen. As illustrated in Fig. 2, the grid comprises a center single-point with two neighboring points spaced by 30 cm in each of three dimensions. This grid is considered to mimic an average human head size and its local movement area. The spatial averaging is realized via the square-root of the power spectral average, or in other words, the root mean square of the magnitude-RTFs at a given frequency *f*,^{11,12}

where $Hk(\u2009f)$ denotes the RTF at point *k* in the measurement grid and *K* = 7 is the total number of RTFs included in multi-point averaging.

FEM is used to generate the simulated RTF dataset, which is used in network training, validation, and testing. The RTFs are simulated using a dense grid in hundreds of rooms with varying room sizes and shapes, and wall impedance characteristics. The frequency response of an 8-in. subwoofer is used as the simulated sound source, which is placed at several positions within each room. Additionally, a robotic arm that allows the simultaneous measurement of multi-point RTFs on the seven-point grid is designed to collect data for the further testing of the network with real RTFs.

For the data-driven local average RTF estimation, an encoder-decoder network architecture with long skip connections is designed, where the encoder and decoder are composed of convolutional and transposed convolutional layers, respectively, and fully connected layers are used in the bottleneck. The DNN is trained on the simulated RTF dataset via supervised learning and tested with both simulated and measured data.

For the evaluation of room EQ performance, an FIR filter prototype design is adopted, in which a minimum-phase FIR filter is computed based on the magnitude-only octave-band smoothed version of the raw RTF. A previously proposed psychoacoustically driven metric^{6} is used for EQ testing over the seven-point grid. Two additional EQ filters are also generated from the single-point and multi-point averaged RTFs for performance comparison between three EQ filter designs.

## III. DATA GENERATION

This section explains the generation of the RTF data used in the proposed data-driven local average estimation method. First, simulated data generation via FEM is detailed, including the geometric and acoustic characteristics of the simulated rooms, modelling of the loudspeaker used in the simulations, and the subsampling procedure on the available grid within a simulated room for data balancing. Then, the design of the robotic arm used to measure real-world RTFs is described and the measurement campaign details are presented.

### A. Simulated data

The training of a DNN generally requires a large amount of data. However, collecting extensive real RIR measurements using a dense grid in many rooms with varying wall properties is labor-intensive, and hence not feasible. Therefore, a simulated dataset of millions of RTFs is generated with the FEM for the supervised learning task. This method has been chosen due to the accuracy with which it can model acoustic wave problems, and its ability to handle complex geometries. The frequency range up to 500 Hz with a 1 Hz frequency resolution is used for the simulation of low-frequency RTFs.

The simulations include mainly three types of rooms: shoebox-shaped rooms, L-shaped rooms, and rooms with one tilted wall. All rooms are assumed to have parallel floors and ceilings. The volumes of the rooms are varied, by independently changing the lengths, widths, and heights of the rooms. In addition, a variety of acoustic conditions are achieved by varying the wall impedances or reverberation times in simulated rooms. Models of actual rooms are also used to generate additional simulated data. In large rooms, the transition region from the low-frequency behavior to the high-frequency regime may be described by the Schroeder frequency,^{23} which is dependent on the room volume and the reverberation time. In small rooms, however, this transition may occur within a region centered at a frequency higher than the one obtained via Schroeder's original formula.^{21,22} Thus, the simulated dataset includes rooms containing single or multiple frequency regimes (i.e., transition and high-frequency regions) depending on the geometric and acoustic properties within the chosen upper operating frequency limit of 500 Hz.

A finite element discretization of the wave equation has the form

where $A=K+i\omega C+(i\omega )2M$, *ω* is the angular frequency, **p** is the unknown pressure, **f** is a source term, and $i$ is the unit imaginary number. The element stiffness, damping, and mass matrices are given, respectively, by

where $\varphi i$ and $\varphi j$ are interpolating shape functions, Ω describes a room's geometry, $\u2202\Omega $ describes a room's boundaries, Γ_{k} identifies a kth surface with impedance *Z _{k}*, $\zeta =Z(\rho c)\u22121$ is a normalized impedance,

*c*is the speed of sound, and

*ρ*is the acoustic medium density. In this work,

*c*= 343 m s

^{−1}and $\rho =1.2$ kg m

^{−3}. The source term is given by

where *V _{n}* is the normal velocity of surface Γ

_{s}, which in this work is the vibrating surface of a loudspeaker driver.

A geometrical model of an 8-in. subwoofer, which is also used in the real RTF measurements, is considered in all simulations. The loudspeaker is modelled as a rectangular cuboid, with dimensions 0.28 × 0.28 × 0.30 m^{3} and a cone shaped driver with a diameter of 0.18 m. The transfer function of the driver velocity is estimated using the conservation of momentum

where *P*_{1} and *P*_{2} are the acoustic pressure transfer functions measured at two on-axis positions close to the driver, with a spacing of $\Delta x$. The source description includes phase information, which allows for more realistic simulated impulse responses (for more details, see the work by Prinn^{32}).

A broad range of impedance conditions, $\zeta k(f)$, are considered. In most of the simulated rooms, different impedances are specified for the floor, walls, and ceiling. In these cases, real-valued impedances are used. These impedances are obtained from the absorption coefficient tables given by Vorländer,^{33} for a selection of surfaces typically found in rooms (e.g., carpet, linoleum, brick, and wood). For the simulations of real rooms, an impedance estimation method, proposed by Prinn *et al.*,^{34} is used to obtain a complex-valued impedance, which is imposed (uniformly) on the floor, walls, and ceiling. The simulated room dimensions range from a very small cube-shaped room of size $2\u2009m\xd72\u2009m\xd72\u2009m$ to a large shoebox-shaped room of size $9.5\u2009m\xd76.3\u2009m\xd73\u2009m$. The simulated acoustic pressure fields are sampled using a grid of equally spaced receivers, with 0.15 m spacing in 3D.

The local variation around a single-point RTF becomes quite small in lower frequencies, having structured patterns across the room resulting from the standing waves associated with the room modes, whereas the variation significantly increases towards the higher frequencies, eventually exhibiting a random behavior in 3D.^{22} If the entire sampling grid from a simulated room were included in a dataset, this would give rise to the overrepresentation of the dominant pattern of minimal change in the low-frequency region, since the room modes are very strong near the walls and particularly close to the room corners, and hence the variations can only be observed around the points located away from the walls towards the center of the room. This would in turn potentially cause the network to learn only an identity mapping from the single-point RTF to the multi-point averaged RTF in the low-frequency region. Thus, a heuristic data balancing approach based on the subsampling of the available room grid is applied as follows. First, the points within 1 m distance from the loudspeaker are excluded to avoid any potential near-field effects. Then, the Manhattan distance between the single-point and multi-point averaged RTFs at the remaining grid points are computed for the frequency range from 20 to 60 Hz and the number of points equal to 15% of the total number of available grid points in the room with the largest distances are included in the dataset. The next 15% are subsampled from the rest of the grid points based on the Manhattan distance computed for the range from 60 to 100 Hz. There is no additional procedure followed for the frequencies above 100 Hz, since it is assumed that there is sufficient level of local variation around a point in higher frequencies. Therefore, as the final step, another 30% are randomly subsampled from the remaining points across the room, leading to a total number of points equivalent to 60% of the entire room grid included in the dataset from a given room. Another consequence is that the exclusion of 40% of the available grid per room may also prevent the DNN from potentially memorizing the rooms in the training dataset. Figure 3 illustrates an example of the room subsampling procedure, where the points included in the dataset are shown on the 2D cross-sectional maps of a room with a loudspeaker placed at a corner. It can be seen in the 2D grid maps that the mid-sections of the room are more densely sampled than the areas closer to the walls and corners as intended.

### B. Measured data

A measurement system using a robotic arm was designed to simultaneously collect two seven-point RTF measurements with two arrays of seven *DPA 4060 series* miniature omni-directional microphones, as shown in Fig. 4. This arm was attached to a *VariSphear* robot originally designed for spherical microphone-array measurements^{35} to acquire data at varying azimuth and elevation angles. The measurement campaign was performed in a laboratory room with slanted walls using three loudspeakers including the same 8-in. subwoofer used in simulations as the reference sound source, and two full-range speakers: *Genelec 8030B* and *KS Digital C8-Coax*. The room temperature was adjusted to 20º, to achieve $c\u2248343$ m s^{−1}. The ISO-standard^{36} reverberation time of the room was measured to be 0.7 s. As illustrated in Fig. 4, the loudspeakers were located at four positions across the room. At each position, the loudspeakers were placed on a turntable on the floor, which was rotated by 45º between successive measurement sessions. The entire campaign was repeated at a second turntable height of 77 cm for all loudspeaker positions and orientations, resulting in a total number of 64 different configurations per loudspeaker (four positions, eight orientations, and two heights). An evenly sampled angular grid was used for the robotic arm rotation in both azimuthal and elevational directions. A rotation angle of 22.5º yielded 16 azimuth angles for the range of [0º,360º) and nine elevation angles for [0º,180º], totaling a number of 16 × 9 × 2 × 64 = 18 432 seven-point RTF recordings per loudspeaker type (16 azimuth angles, 9 elevation angles, 2 microphone arrays, 64 speaker configurations).

A logarithmic sweep with a duration of 2 s, starting at 20 Hz and ending at 20 kHz, followed by a silence period of 2 s was used to obtain the RIRs from a given loudspeaker to the two microphone arrays on the robotic arm through a standard deconvolution at a sampling rate of *f _{s}* = 48 kHz. The system delay was also obtained via a loopback measurement to ensure that any initial delay in a recorded RIR corresponded to the acoustic direct path propagation distance.

To use a relatively low-order FIR length in room EQ filter design while keeping the same 1 Hz frequency resolution as in the simulations, the recorded RIRs were first low-pass filtered with the cut-off frequency of $fc=750$ Hz (chosen to be higher than the operating frequency limit of 500 Hz to apply the pre-processing step described in Sec. IV A) and downsampled to the sampling frequency of *f _{s}* = 3 kHz. RTFs were then computed by applying the fast Fourier transform (FFT) over the initial

*N*= 2048 samples of the downsampled RIRs.

## IV. PROPOSED METHOD

This section details the deep learning framework for the proposed local average RTF estimation method. First, the pre-processing applied on the simulated and measured RTFs and the steps taken during the selection of the training, validation, and test datasets are explained. Then, the network architecture and the training parameters chosen for the proposed data-driven method are presented.

### A. Training, validation, and test datasets

The simulated and measured RTFs are pre-processed before being fed into the DNN as follows. The spectrum is first trimmed down to the frequency range between 10 and 500 Hz in the simulated and measured RTFs, resulting in an input and output data vector of length 491 with 1 Hz frequency resolution. Frequency-shaping is applied on the linear-scale magnitude-RTF between 10 and 20 Hz, and 400 and 500 Hz to achieve a fade-in and fade-out effect using the first- and second-half of two Hann windows, respectively. The main purpose of this step is to achieve a smooth decay towards zero on both ends of the spectrum, which may help mitigate the potential problems at the data boundaries arising from the use of convolutional layers in the network architecture. This issue may also be resolved using padding, but an acoustically more relevant solution is considered here. An alternative choice would also be to the use of a crossover filter such as Linkwitz-Riley,^{1,37} which would, however, require a much broader operating frequency range to accommodate the gradual decay at the lower and higher ends. Furthermore, the magnitude of the RTF is aligned around a reference 0-decibel (dB) line by applying a global gain, which is needed for the EQ filter design in digital room correction.^{1,6} In this work, the gain alignment is obtained by computing the square-root of the power-spectral mean of the single-point RTF taken over the 1-octave band centered at the pre-determined reference frequency of 200 Hz (i.e., the frequency range of $141\u2013282$ Hz). The same gain is also applied on the target multi-point averaged RTF. As the final step in data pre-processing, the single-point and multi-point averaged RTFs are first converted to dB-scale and rescaled such that the normalized range of $[0,1]$ corresponds to the original dB-range between –50 and 30 dB based on the empirical observation that RTFs generally lie within this dB-range after gain alignment. The frequency bins occasionally taking a negative value after rescaling are set to zero.

The simulated dataset used in DNN training and validation is summarized in Table I. The test datasets used in performance evaluation are given in Table II. For the training dataset, three source positions are used in each room, as depicted in Fig. 5. For the validation and simulated test datasets, RTFs are also simulated for source positions that are at different distances from the walls and varying heights. The training dataset includes a combination of 111 rooms, 7 wall-impedance conditions, and 3 loudspeaker positions, resulting in a total of 987 different room configurations. The validation dataset consists of 71 room configurations including a combination of 9 rooms, 11 wall-impedance conditions, and 20 loudspeaker positions. The simulated test dataset (unseen by the network during training and validation) is made up of 163 room configurations containing a combination of 17 rooms, 12 wall-impedance conditions, and 20 loudspeaker positions. The data measured with the robotic arm using three loudspeakers are used only for testing and include 64 different room configurations per loudspeaker.

Dataset . | #Room config. . | #7-pt avg. RTFs . |
---|---|---|

Training | 753 | 4, 562, 820 |

Validation | 71 | 553, 680 |

Dataset . | #Room config. . | #7-pt avg. RTFs . |
---|---|---|

Training | 753 | 4, 562, 820 |

Validation | 71 | 553, 680 |

Test dataset . | #Room config. . | #7-pt avg. RTFs . |
---|---|---|

Simulation | 163 | 515, 158 |

8 in. Subwoofer | 64 | 18, 432 |

KS Dig. C8-Coax | 64 | 18, 432 |

Genelec 8030B | 64 | 18, 432 |

Test dataset . | #Room config. . | #7-pt avg. RTFs . |
---|---|---|

Simulation | 163 | 515, 158 |

8 in. Subwoofer | 64 | 18, 432 |

KS Dig. C8-Coax | 64 | 18, 432 |

Genelec 8030B | 64 | 18, 432 |

### B. Network architecture and training

The encoder-decoder network used for the data-driven local average RTF estimation depicted in Fig. 6 is adapted from the U-Net designs^{30,31} originally proposed for speech enhancement with the exception of the bottleneck, where the recurrent layers are replaced with fully connected layers, since there is no temporal component. The symmetric network architecture consists of three convolutional and transposed convolutional layers in the encoder and decoder, respectively, and a bottleneck comprising two fully connected layers in between. The kernels used in all encoder/decoder layers have the length of 6, which are slid along the features with a stride of 4 for downsampling, resulting in convolutional layers with output lengths of 122, 30, and 7, and transposed convolutional layers with input lengths of 7, 30, and 122, following the symmetric structure with equal input and output lengths of 491. Similar to the U-Net implementation in the work by Braun *et al.*,^{31} a long skip connection from an encoder output is added to its associated decoder input instead of concatenation, as initial trials have shown that this does not yield any noticeable performance reduction while lowering the computational complexity, since otherwise the concatenation would result in doubling the number of decoder input channels. The number of output channels for three convolutional layers are 16, 32, and 64 and the number of input channels for three transposed convolutional layers are 64, 32, and 16 accordingly. The bottleneck is made up of two symmetric fully connected layers of the *input* × *output* size 448 × 32 and 32 × 448, respectively. The rectified linear unit (ReLU) function is used as the activation function at all layer outputs, except for the final layer, where the linear activation is used. The bias component is disabled in the input of all network layers.

For the network training, the loss function is selected as the mean absolute error (MAE) between the target *seven-point average RTF* and the estimated *local average RTF* at the network output. The batch size is chosen to be 20, and AdamW^{38} is used as the network optimizer with a learning rate of $\lambda =10\u22123$, and a weight decay rate of $10\u22122$. Early stopping is used to prevent overfitting during training.

## V. ROOM EQUALIZATION

The local average RTF estimated by the DNN may be used with any room EQ framework of choice. The filter design adopted in this work to compare the room EQ performance of the proposed data-driven local average RTF estimation method with the single-point RTF and multi-point averaged RTF involves generating a minimum-phase FIR filter prototype with a length of 2048 (the same as the FFT length of *N*) from the raw linear-scale magnitude RTF. The target curve is pre-defined as the same frequency-shaping curve used in RTF pre-processing, since it has a flat bandpass response while smoothly decaying at both ends of the frequency region of interest, owning the desired response characteristics.^{1} As the RTF has already been centered around the 0-dB line in the pre-processing step, no additional gain alignment is needed. Next, a magnitude-only 1/6-octave smoothing^{39} is applied on the RTF, and the selected target curve is divided by the resulting smoothed RTF to obtain the initial raw FIR filter magnitude response. An additional soft gain limiter is used at this point to avoid excessive boosting of the dips in the spectrum, where the maximum gain is limited to 12 dB. Then, a frequency limiter is applied between 20 and 40 Hz, and then between 450 and 470 Hz to limit the FIR filter to the 0-dB gain on both ends of the spectrum, resulting in an effective EQ frequency range between 20 and 470 Hz. Finally, the minimum-phase response of the FIR filter is synthesized from the cepstrum of the magnitude response, which is obtained via the Hilbert transform. An example of a room EQ filter design is illustrated in Fig. 7.

FIR-based equalizers^{3,4,40–42} generally need high-order filter implementations, making automated design procedures proposed for parametric IIR equalizers with low-order filters a more attractive solution in real-world applications.^{1,5,6} Those parametric equalizers comprise a cascade of second-order peaking EQs, also known as biquad or second-order-section (SOS) filters, whose parameters including gain, center frequency, and bandwidth are adjusted through an optimization procedure. However, the design of such an automated procedure is beyond the scope of this work and the synthesized FIR filter prototype is regarded here as the ideal EQ filter that needs to be approximated by a parametric IIR equalizer.^{1,5,6} Another alternative may be to use graphic equalizers, which have filters with fixed center frequencies and bandwidths, only allowing the filter gains to be adjusted.^{43,44}

## VI. PERFORMANCE EVALUATION

### A. Local average RTF estimation

To evaluate the local average RTF estimation performance, MAE and the root mean squared error (RMSE) are computed based on three different magnitude-RTF scales: the linear-scale magnitude, dB-scale magnitude and the normalized dB-scale magnitude (the one used in network training) for the test datasets are presented in Table III, where the mean and standard deviation of each case is computed over the entire dataset of interest. Furthermore, ten examples of the data-driven local average RTF estimates randomly selected from the simulated test dataset in comparison to their corresponding single-point and multi-point RTFs are plotted in Fig. 8 based on the linear frequency scale in line with the actual problem setup.

. | MAE . | RMSE . | ||||
---|---|---|---|---|---|---|

Test dataset . | Linear-scale . | dB-scale . | Norm. dB-scale . | Linear-scale . | dB-scale . | Norm. dB-scale . |

Simulation | 0.1153 ± 0.0449 | 1.4520 ± 0.3162 | 0.0181 ± 0.0040 | 0.1718 ± 0.0721 | 2.3457 ± 0.3602 | 0.0293 ± 0.0045 |

8 in. Subwoofer | 0.1228 ± 0.0280 | 1.7400 ± 0.1881 | 0.0217 ± 0.0024 | 0.1712 ± 0.0404 | 2.8603 ± 0.2664 | 0.0358 ± 0.0033 |

KS Dig. C8-Coax | 0.1316 ± 0.0284 | 2.0637 ± 0.2752 | 0.0258 ± 0.0034 | 0.1756 ± 0.0364 | 3.6180 ± 0.3635 | 0.0452 ± 0.0045 |

Genelec 8030B | 0.1368 ± 0.0305 | 2.0940 ± 0.2545 | 0.0262 ± 0.0032 | 0.1837 ± 0.0401 | 3.6874 ± 0.3262 | 0.0461 ± 0.0041 |

. | MAE . | RMSE . | ||||
---|---|---|---|---|---|---|

Test dataset . | Linear-scale . | dB-scale . | Norm. dB-scale . | Linear-scale . | dB-scale . | Norm. dB-scale . |

Simulation | 0.1153 ± 0.0449 | 1.4520 ± 0.3162 | 0.0181 ± 0.0040 | 0.1718 ± 0.0721 | 2.3457 ± 0.3602 | 0.0293 ± 0.0045 |

8 in. Subwoofer | 0.1228 ± 0.0280 | 1.7400 ± 0.1881 | 0.0217 ± 0.0024 | 0.1712 ± 0.0404 | 2.8603 ± 0.2664 | 0.0358 ± 0.0033 |

KS Dig. C8-Coax | 0.1316 ± 0.0284 | 2.0637 ± 0.2752 | 0.0258 ± 0.0034 | 0.1756 ± 0.0364 | 3.6180 ± 0.3635 | 0.0452 ± 0.0045 |

Genelec 8030B | 0.1368 ± 0.0305 | 2.0940 ± 0.2545 | 0.0262 ± 0.0032 | 0.1837 ± 0.0401 | 3.6874 ± 0.3262 | 0.0461 ± 0.0041 |

The mean of the errors for the dB-scale and the normalized dB-scale presented in Table III are the closest between the simulated and the measured subwoofer data as expected, and slightly higher for the other loudspeakers, since they had different frequency responses than the subwoofer used as the sound source in the simulations. The mean errors for the linear-scale are very close to each other, which may be explained by the fact that the contribution of the dips to the overall error is significantly less in the linear-scale than in the dB-scale. As seen in Fig. 8, the estimation performance degrades with increasing frequency, due to the fact that the magnitude variation in the local proximity gradually increases within a room and also among different rooms above the Schroeder frequency. The visual inspection also reveals that the spatial smoothing operation learned by the network tends to preserve the peaks of the single-point RTFs and smooth out the notches, similar to the actual multi-point averaging, but in a more structured way, meaning that the DNN has captured the common patterns shared by many different rooms in the training dataset, since there is no auxiliary information on individual rooms such as the room geometry, modes or wall properties fed to the network during training.

To expand the analysis into an entire room, the 2D cross-sectional maps representing the magnitude distribution of the single-point, network and multi-point averaged RTFs across a simulated rectangular room from the test dataset with the subwoofer placed at the corner are shown in Fig. 9 at several frequencies at the height of 1.20 m, which may be regarded roughly as the average ear height of a seated listener (top row), and at different heights at the frequency of 250 Hz (bottom row). The 2D maps indicate that the trained network is able to capture the 3D spatial information from the 1D RTF data to a great extent despite the significantly varying behavior across the spectrum. Although it can be seen from the 2D maps that the network may fail to produce some of the details that would require extra knowledge about the room, it is still able to successfully extract the overall 3D pattern in a room. In this room example, the sizes of the locally smoothed areas do not fully match between the network and multi-point average RTF maps, which can also be noticed in some of the 1D examples given in Fig. 8, where the multi-point average RTF significantly deviates from the single-point RTF at relatively higher frequencies while the estimated local average RTF tends to stay closer to the single-point RTF.

### B. Room equalization

Human hearing tends to be more sensitive to the peaks than the dips across the spectrum.^{21,22} Interestingly, this psychoacoustic pattern is also captured by the multi-point averaged RTF, which can be described as a room-aware peak-preserving notch smoothing operation that cannot be achieved via any traditional frequency-domain smoothing techniques applied to the single-point RTF. The sum of squared errors (SSE) proposed by Vairetti *et al.*^{6} for the use as a cost function for IIR parametric equalizer design is adopted here as a perceptually driven spectral deviation measure for the multi-point equalization scenario. The SSE is given by

where $HEQ(f)$ and *T*(*f*) denote the EQ FIR filter and the target curve, respectively, *F* is the total number of frequency bins. As a result of using the linear-scale magnitude-RTF, the SSE puts more emphasis on the peaks than the dips in the spectrum while measuring the spectral deviation from the selected target curve.

SSE is computed for pre-EQ [i.e., $HEQ(f)=1$], single-point, network and multi-point EQs over 1/6-octave smoothed RTFs using Eq. (8) in the frequency range from 30 to 450 Hz, along which it was assumed to have a sufficiently high audible signal level.

Three additional SSE-based *relative performance scores* (RPS) were also used in EQ evaluation. The first RPS is defined as

where a value of 1 corresponds to the perfect match with the target curve (i.e., $SSE(EQ)=0$) and any negative value implies that the room EQ undesirably leads to a degradation with respect to the pre-EQ SSE. The second RPS is computed for only network and multi-point EQs to assess the performance relative to the single-point EQ,

The third RPS is defined as

which provides a comparison between the single-point EQ and the other two EQ methods relative to the pre-EQ level. The distributions over the simulated test dataset for all four metrics are shown in Fig. 10 using histogram plots. The mean and standard deviation computed over the entire set for the simulated and measured test datasets are presented for SSE in Table IV and for $RPS(1)$ and $RPS(2)$ in Table V.

Test dataset . | Pre-EQ . | Single-point EQ . | Network EQ . | Multi-point EQ . |
---|---|---|---|---|

Simulation | 0.3291 ± 0.4022 | 0.1332 ± 0.1375 | 0.0953 ± 0.0564 | 0.0676 ± 0.0261 |

8″ Subwoofer | 0.1919 ± 0.1626 | 0.0676 ± 0.0220 | 0.0608 ± 0.0142 | 0.0471 ± 0.0009 |

KS Dig. C8-Coax | 0.1998 ± 0.1404 | 0.0664 ± 0.0226 | 0.0607 ± 0.0133 | 0.0458 ± 0.0083 |

Genelec 8030B | 0.1819 ± 0.1203 | 0.0656 ± 0.0210 | 0.0610 ± 0.0133 | 0.0465 ± 0.0085 |

Test dataset . | Pre-EQ . | Single-point EQ . | Network EQ . | Multi-point EQ . |
---|---|---|---|---|

Simulation | 0.3291 ± 0.4022 | 0.1332 ± 0.1375 | 0.0953 ± 0.0564 | 0.0676 ± 0.0261 |

8″ Subwoofer | 0.1919 ± 0.1626 | 0.0676 ± 0.0220 | 0.0608 ± 0.0142 | 0.0471 ± 0.0009 |

KS Dig. C8-Coax | 0.1998 ± 0.1404 | 0.0664 ± 0.0226 | 0.0607 ± 0.0133 | 0.0458 ± 0.0083 |

Genelec 8030B | 0.1819 ± 0.1203 | 0.0656 ± 0.0210 | 0.0610 ± 0.0133 | 0.0465 ± 0.0085 |

. | RPS^{(1)}
. | RPS^{(2)}
. | |||
---|---|---|---|---|---|

Test dataset . | Single-point EQ . | Network EQ . | Multi-point EQ . | Network EQ . | Multi-point EQ . |

Simulation | 0.4848 ± 0.3666 | 0.5969 ± 0.2268 | 0.7019 ± 0.1458 | 0.1637 ± 0.1938 | 0.3522 ± 0.1980 |

8″ Subwoofer | 0.5515 ± 0.2029 | 0.5943 ± 0.1651 | 0.6839 ± 0.1215 | 0.0751 ± 0.1244 | 0.2719 ± 0.1259 |

KS Dig. C8-Coax | 0.5915 ± 0.1986 | 0.6269 ± 0.1507 | 0.7163 ± 0.1108 | 0.0497 ± 0.1635 | 0.2772 ± 0.1262 |

Genelec 8030B | 0.5664 ± 0.1988 | 0.5972 ± 0.1531 | 0.6909 ± 0.1124 | 0.0379 ± 0.1525 | 0.2608 ± 0.1228 |

. | RPS^{(1)}
. | RPS^{(2)}
. | |||
---|---|---|---|---|---|

Test dataset . | Single-point EQ . | Network EQ . | Multi-point EQ . | Network EQ . | Multi-point EQ . |

Simulation | 0.4848 ± 0.3666 | 0.5969 ± 0.2268 | 0.7019 ± 0.1458 | 0.1637 ± 0.1938 | 0.3522 ± 0.1980 |

8″ Subwoofer | 0.5515 ± 0.2029 | 0.5943 ± 0.1651 | 0.6839 ± 0.1215 | 0.0751 ± 0.1244 | 0.2719 ± 0.1259 |

KS Dig. C8-Coax | 0.5915 ± 0.1986 | 0.6269 ± 0.1507 | 0.7163 ± 0.1108 | 0.0497 ± 0.1635 | 0.2772 ± 0.1262 |

Genelec 8030B | 0.5664 ± 0.1988 | 0.5972 ± 0.1531 | 0.6909 ± 0.1124 | 0.0379 ± 0.1525 | 0.2608 ± 0.1228 |

The SSE-based spectral deviation measures demonstrate that the network EQ achieves a performance level somewhere between the single-point EQ and the multi-point EQ. When applied on raw RTFs on the seven-point grid, the multi-point EQ always brings an improvement, whereas some minor degradation may occur after the single-point EQ and network EQ in only a small number of the cases as seen in histogram plots of $RPS(1)$. When compared based on the $RPS(2)$ histograms, the multi-point EQ outperforms the single-point EQ in all cases while the network EQ results in an SSE higher than the single-point EQ in roughly 20% of the samples in the simulated test dataset. However, the $RPS(3)$ histograms reveal that the degradation may generally be deemed small for these samples, concerning the pre-EQ SSE levels. As seen in Table IV, the measured test datasets have an SSE distribution tighter than the simulated test dataset as a result of the fact that the same angular grid was used for the robotic arm in all loudspeaker configurations in one single room. This has limited the level of potential improvement that could be achieved by the multi-point EQ and network EQ over the single-point EQ as reflected on RPS results given in Table V, but the mean $RPS(1)$ and the mean $RPS(2)$ still have achieved a positive value. As expected, the performance is noticeably worse for the two full-range loudspeakers, whose frequency responses were not used at all in the dataset generation for DNN training.

The 2D cross-sectional maps showing the SSE and $RPS(1)$ distributions across a simulated rectangular room at varying elevations with the subwoofer positioned at a corner are plotted in Fig. 11. The multi-point EQ achieves nearly uniform SSE levels across the entire room, despite the fact that the different parts of the room exhibit varying characteristics before any EQ is applied. The single-point EQ yields limited improvement in various regions across the room, whereas network EQ brings in these regions with generally high pre-EQ SSE particular benefit, which is, however, still somewhat less than the multi-point EQ. On the contrary, it can be seen from the 2D $RPS(1)$ maps that the single-point EQ deteriorates or does not bring any noticeable improvement in the regions, where the pre-EQ is already low, meaning that the room EQ may only have a limited effect. Yet, the multi-point EQ still reduces the SSE in these regions, and the network EQ also somewhat struggles in these regions, but to a much lesser extent when compared to the single-point EQ.

### C. Discussion

Estimating a local average RTF from a single-point RTF is not a trivial task, as the necessary information on the sound field around the given point is lost with the removal of multi-point measurements. Yet, it is remarkable that deep learning can infer the general 3D patterns in room acoustics from 1D data, still yielding benefit in the single-point based digital room correction. Including some auxiliary information obtained by the existing techniques for room geometry inference^{45} or wall impedance estimation^{34} in the DNN setup could potentially further close the gap between the network and multi-point EQ performances. This work aims at the development of a data-driven local average RTF estimator that is independent of any EQ scheme, but alternatively, a data-centric AI approach may also be advantageous by training the network with a dataset revised depending on the EQ framework. For instance, the most problematic cases determined in the initially available dataset through the selected EQ approach may intentionally be overrepresented in the final training dataset to achieve an EQ-oriented boosted performance. Alternative data-centric strategies may also include the diversification of the loudspeaker positions in the training dataset, which could potentially improve the generalization of the trained model against unseen room scenarios. In this work, the focus, though, was on operating in the low-frequency region, and hence a subwoofer, which is typically placed on the floor, was used as the reference sound source in the simulations. Therefore, three on-floor positions were only used for the training dataset, which were considered to be the placements commonly found in living room setups. Furthermore, the goal of this work was the development of a proof of concept that could operate in a variety of room sizes ranging from very small rooms to very large ones. However, as another potential data generation strategy, the center of attention in the selected dataset could be shifted towards more commonly found living room sizes and conditions or some application-dependent sets of rooms.

At the moment, the EQ performance comparison is based on an objective evaluation using a psychoacoustically driven metric. However, only an actual subjective testing could verify whether the SSE metric aligns with the human perception or a revised metric is necessary, how audibly beneficial the selected multi-point EQ scheme is compared to the single-point EQ, and whether the network EQ yields audible improvement over the single-point EQ. The current version of the proposed method focuses on the operation in low frequencies. The expansion for the use of full-range loudspeakers would require increased simulation capability. This could potentially be realized by combining the low-frequency FEM simulations with the simulated data generated using the image method^{46} for high frequencies. Certainly, another alternative would be the collection of real-world RIR measurements in various room conditions based on a denser sampling grid. Furthermore, the frequency-range extension may require a reformulation of the multi-point EQ problem such as the resampling of the spectrum to achieve a logarithmic frequency scale^{1} to reduce computational complexity. Additional future aspects may also include adopting multi-point EQ approaches existing in the literature^{7–14} to both objectively and subjectively evaluate their potential benefits and drawbacks and altering the currently used multi-point RTF grid in terms of the number of measurement points, the grid size or its structure, for instance, to further expand the EQ zone over a larger listening area. Another potential modification on the problem setup would be the design of a DNN that serves as an EQ filter synthesizer, in which the FIR filter would be estimated directly via deep learning in an end-to-end fashion with the potential use of a differentiable metric such as SSE as the loss function.

## VII. CONCLUSION

A data-driven method for estimating a local average RTF from a single-point measurement in the low-frequency region has been proposed. The deep neural network has been trained and tested with RTFs simulated via FEM in many room conditions. Additional evaluation has been made with the real-world data measured in a small room using a robotic arm. It has been shown that the network has learned a spatial smoothing operation, in which the peaks of the single-point RTFs are preserved while the notches across the spectrum are smoothed out in line with the intended spatial RTF averaging. It has also been pointed out that the proposed data-driven method can capture the general 3D room patterns despite only processing 1D RTFs.

An FIR filter design approach has been adopted for digital room correction, and a psychoacoustically driven metric has been used for EQ performance evaluation. It has been found that although the network EQ has still fallen short of the multi-point EQ performance, the use of the estimated local average RTF has generally yielded improvement over single-point EQ.

Future work may include extending the proposed method to support a wider frequency range and testing different spatial averaging strategies and multi-point EQ methods. In addition, expanding the existing simulated RTF dataset and collecting more real-world data using a diverse set of loudspeaker positions would also be beneficial to improve the model generalization against unseen room acoustic conditions. Listening tests may also be conducted to subjectively evaluate the room EQ performance.

## ACKNOWLEDGMENTS

Parts of this work have been funded by the Free State of Bavaria in the DSAI project. The authors would like to thank Felix Knauff and Lukas Friede for the robotic arm design and the RIR measurement campaign, and Julien Haibach for the feedback on the room EQ framework.