Immersive and spatial sound reproduction has been widely studied using loudspeaker arrays. However, flat-panel loudspeakers that utilize thin flat panels with force actuators are a promising alternative to traditional coaxial loudspeakers for practical applications, with benefits in low-visual profiles and diffuse radiation. Literature has addressed the sound quality and applications of flat-panel loudspeakers in three-dimensional sound reproduction, such as wave field synthesis and sound zones. This paper revisits the spatial sound perception of flat-panel loudspeakers, specifically the localization mismatch between the perceived and desired sound directions when using amplitude panning. Subjective tests in an anechoic chamber with 24 subjects result in the mean azimuth direction mismatch within ±6.0° and the mean elevation mismatch within ±10.0°. The experimental results show that the virtual source created by amplitude panning over a flat-panel loudspeaker still achieves spatial localization accuracy close to that of a real sound source, despite not using complex algorithms or acoustic transfer function information. The findings of this study establish a benchmark for virtual source localization in spatial sound reproduction using flat-panel loudspeakers, which can serve as a starting point for future research and optimization of algorithms.
I. INTRODUCTION
Extensive research is dedicated to using loudspeaker arrays to reproduce spatial sound for creating immersive and realistic listening experiences.1 Recreate the auditory sense of space2 and the localization of perceived sound sources,3–5 which is essential for various applications such as augmented or mixed reality (AR/MR), multimedia content creation,1,4,6 and personalized sound zones.7,8 However, developing this technology from laboratory prototypes for use in real-world settings, especially in complex acoustic environments like buildings, requires further exploration. Previous research has explored various approaches, including, but not limited to, equalization of room responses,9 optimization of robustness,6,10 optimization of loudspeaker placement,11,12 implementation simplification by reducing acoustics transfer function measurement,8,13 or using distributed systems.7 However, loudspeakers are limited by the physical structure and spatial placement in sound reproduction. For example, coaxial loudspeakers can be impractical for certain applications due to their weight, cost, or other factors. Additionally, reproducing sound with sufficient spatial coverage requires loudspeakers to have an appropriate spatial span while maintaining small enough spacing for controlling sound waves at high frequencies.
The flat-panel loudspeaker is a promising alternative to traditional coaxial loudspeakers with advantages in low-visual profile and wide sound dispersion.14 It uses a thin, flat panel with force actuators at the rear side to generate acoustic radiation through the panel vibration. It can adapt to various indoor environments and even utilize existing displays, e.g., the organic light-emitting diode (OLED) screen, to generate spatialized audio.15 On the other hand, the flat-panel loudspeaker, as the multi-actuator panel (MAP), has multiple exciters driven with different signals, respectively, using signal processing to allow dynamic control of the panel's spatial vibration profile with diffuse sound radiation characteristic.16 This characteristic helps avoid the beaming properties of piston loudspeakers at high frequencies17 and reduces modal excitation within rooms.18 Compared to conventional loudspeakers, the sound quality could be challenging when applying the flat-panel loudspeaker. Though not the scope of this paper, existing literature has widely addressed sound quality improvement19 and applications of flat-panel loudspeakers in three-dimensional (3D) sound reproduction, such as wave field synthesis14,20 and directional sound fields.21
The main aim of sound reproduction is to provide listeners with a clear spatial perception by utilizing psychoacoustic cues that lead to perceptual satisfaction. However, the human auditory perception mechanism is complex, resulting in selective emphasis and suppression, even under unfavorable conditions, such as the cocktail party effect. Therefore, considering subjective perception is crucial for evaluating, designing, and optimizing sound reproduction methods. The flat-panel loudspeaker has been perceptually evaluated, including loudness,22,23 sound localization, perception of sound distance by wave field synthesis,24 and sound quality enhancement.25 Furthermore, it is compared with the electrodynamic loudspeaker on objective and subjective measures for wave field synthesis.26
So far, vector-based amplitude panning (VBAP)27 remains an effective and straightforward method to create virtual sound sources using traditional loudspeakers arbitrarily placed in space, with ongoing research and development.28,29 VBAP has several practical advantages, including low computational complexity, no destructive interference within the sweet spot, superior timbral quality, and gradual sound quality degradation outside the sweet spot.29 Moreover, it does not require precise information on the acoustic transfer function for implementation. This feature is essential for controlling flat-panel loudspeakers since their sound radiation is affected by several factors, such as material, boundary conditions, and coupling. While VBAP has been utilized with flat-panel loudspeakers,30 a complete and thorough evaluation of spatial sound panning with the flat-panel loudspeaker is still lacking.31
This paper revisits the spatial sound perception of flat-panel loudspeakers, specifically the localization mismatch between the perceived and desired sound directions when using amplitude panning. An experiment involving subjective and objective tests utilized a vector-based amplitude panning algorithm to create 81 virtual sources with four actuators placed at the corners of a flat panel. Subjective listening tests included 24 normal-hearing subjects, while objective tests included results on the interaural time difference (ITD) and the interaural level differences (ILD) measured in an anechoic chamber. The study aims to determine the spatial localization accuracy of amplitude panning using a flat-panel loudspeaker. As amplitude panning does not rely on complex algorithms or acoustic transfer function information, the experiment results can be a benchmark for virtual source localization in spatial sound reproduction using flat-panel loudspeakers. This information can be useful for future research and optimization of algorithms in this area.
II. THEORY
A. Flat panel with actuator array
Please note that the derivation presented here may need to be more precise for the near-field condition, where the sound pressure radiated by the source is considerably more complex and has intricate oscillatory features that cannot be accurately approximated. However, near-field scenarios can be quite demanding, such as when the user is situated within 1 m of the display screen. Furthermore, the practical boundary conditions can differ from those assumed in the derivation, making it challenging to obtain an analytical solution.
Though the sound field produced by a flat-panel loudspeaker is highly dependent on frequency and the relative observation position to the sound source,21 it is found that each actuator element can vibrate independently without being affected by neighboring exciters and panel edges, as confirmed by laser Doppler vibrometer measurements.32 It implies that individual exciters can be considered as independent sources for spatial sound reproduction, so that we can create a virtual sound using amplitude panning.
B. Amplitude panning using the flat-panel loudspeaker
Three-dimensional amplitude panning using a flat-panel loudspeaker with four actuators “ACT.1–4.” For example, actuators 1, 2, and 4 are the activated triplets to create a virtual sound source in the direction of . Unit vectors , and represent the directions from the listener to each actuator in Cartesian coordinates.
Three-dimensional amplitude panning using a flat-panel loudspeaker with four actuators “ACT.1–4.” For example, actuators 1, 2, and 4 are the activated triplets to create a virtual sound source in the direction of . Unit vectors , and represent the directions from the listener to each actuator in Cartesian coordinates.
VBAP includes a geometric determination of the triangle of active loudspeakers and an algebraic solution to compute the panning gains such that the velocity vector of the synthesized sound field matches the direction of the virtual source. Though it does not require acoustic transfer function information, the spatial localization accuracy generated by conventional loudspeakers with VBAP is within and in azimuth and elevation, respectively.34 In comparison, perceiving a real sound source has the mean azimuth mismatch ranges from – , and the mean elevation mismatch in the median plane ranges from for white noise to for speech.3 Conventional loudspeakers are discrete sound sources in terms of spatial distribution. On the other hand, a flat panel with multiple actuators is a continuous sound source, e.g., in Eq. (3). The sound received from the flat-panel loudspeaker is contingent upon the plate's vibration, i.e., in Eq. (4). Since actuators have different gains, but the same phase under VBAP, the higher value aligns with the vicinity of the actuators. Thus, due to spatial masking, the perceived sound of VBAP using the flat panel may be comparable to that of conventional loudspeakers. The following section will present experimental characteristics of the virtual source localization with amplitude panning on a flat panel with actuators.
III. OBJECTIVE EVALUATION OF VIRTUAL SOURCE DIRECTION
The experiment was designed to objectively and subjectively evaluate the spatial sound perception of flat-panel loudspeakers, specifically, the localization mismatch between the perceived and desired sound directions when using amplitude panning. As illustrated in Fig. 2, we consider the indoor displays scenario and the listener in the near-field with a distance from the center of the subject's head O to the center of the panel. The flat panel is a 0.2 mm–thick aluminum stencil of dimensions with fixed edges. Virtual sound sources were created, respectively, at 81 locations within the square region of sizes . The four actuators are at the vertexes , and , respectively, with , as shown in Fig. 3. Virtual sources are denoted as , where i = 1, 2,…, 9, and j = 1, 2,…, 9, represent the row and column numbers, respectively. So, the available azimuth and elevation ranges of virtual sound sources were within . The experiment was carried out within the anechoic chamber located at Institute of Acoustics, Chinese Academy of Sciences. The chamber's dimensions are 6.40 m in length, 4.70 m in width, and 4.70 m in height, with a usable height of 3.20 m.
(Color online) Experimental setup using the KEMAR Head and Torso simulator at the listener location with a distance of 70 cm from the panel.
(Color online) Experimental setup using the KEMAR Head and Torso simulator at the listener location with a distance of 70 cm from the panel.
(Color online) The distance from the center of the listener's head to that of the panel was . Four actuators ACT.1 4 were at the rear side of the panel with . Eighty-one virtual sources , i = 1, 2,…, 9 and j = 1, 2,…, 9, were within the square region (contoured in black) and the four actuators at the vertexes.
(Color online) The distance from the center of the listener's head to that of the panel was . Four actuators ACT.1 4 were at the rear side of the panel with . Eighty-one virtual sources , i = 1, 2,…, 9 and j = 1, 2,…, 9, were within the square region (contoured in black) and the four actuators at the vertexes.
Stimuli were generated at a sampling rate of 48 kHz. VBAP gains were calculated based on codes from Refs. 35 and 36. Output signals were played as multi-channel.flac files, with a pink noise signal in each channel for the corresponding actuator. The computer was equipped with a Fireface UC audio interface (RME Audio, Haimhausen, Germany) for digital to analog conversion. Separate power amplifiers drove four actuators. The stimuli were reproduced within an effective volume range, determined by a pre-test of channel distortion to ensure a total harmonic distortion (THD) of less than 10%. Furthermore, the reproduction process involved the equalization37 and inter-channel calibration to achieve flattened and aligned frequency responses from all actuators over 100 Hz to 20 kHz.
In the experiment, ITD and ILD for different virtual sources are calculated from recordings using the Head and Torso simulator (GRAS Sound & Vibration, Holte, Denmark) and 5 s pink noise as the test signal. Then, we obtained the corresponding perceptual virtual source directions by referencing a high-resolution lookup table with simulated ITD and ILD values based on the database on head-related transfer functions and interpolation.41
Figure 4 presents the localization mismatch between the perceptual virtual source direction (obtained from the ITD measurement) and the desired sound direction for each virtual source. The tested values are smoothed for visualization. For virtual sources at most locations on the panel, the azimuth and elevation mismatch values are relatively small. A few azimuth mismatches of negative values but no worse than appear in the lower right area ( ), while a few elevation mismatches of positive values, but no worse than , locate in the lower right area ( ).
(Color online) Localization mismatch between the perceptual virtual source direction (obtained from the ITD measurement) and the desired sound direction for each virtual source in Fig. 3. (a) Azimuth mismatch, (b) elevation mismatch.
(Color online) Localization mismatch between the perceptual virtual source direction (obtained from the ITD measurement) and the desired sound direction for each virtual source in Fig. 3. (a) Azimuth mismatch, (b) elevation mismatch.
Figure 5 presents the localization mismatch between the perceptual virtual source direction (obtained from the ILD measurement) and the desired sound direction for each virtual source at 2500 and 5000 Hz, respectively. The tested values are smoothed for visualization. Similar to Fig. 4 for virtual sources at most locations on the panel, the azimuth and elevation mismatch values are relatively small. ILD values are frequency dependent. Mismatch values associated with the lower frequency at 2500 Hz range exhibit more significant deviations among different virtual source heights. At a frequency of 2500 Hz, the lower region near the center of the panel ( ) exhibits larger azimuth localization mismatch values. Meanwhile, the consistency of different locations across various locations increases for elevation mismatch, and elevation mismatch values fall within a range of 0.3. At a frequency of 5000 Hz, larger mismatch values occur when the virtual source is positioned on the lower right area of the panel ( ).
(Color online) Localization mismatch between the perceptual virtual source direction (obtained from the ILD measurement) and the desired sound direction for each virtual source in Fig. 3. (a) Azimuth mismatch at 2500 Hz, (b) elevation mismatch at 2500 Hz, (c) azimuth mismatch at 5000 Hz, (d) elevation mismatch at 5000 Hz.
(Color online) Localization mismatch between the perceptual virtual source direction (obtained from the ILD measurement) and the desired sound direction for each virtual source in Fig. 3. (a) Azimuth mismatch at 2500 Hz, (b) elevation mismatch at 2500 Hz, (c) azimuth mismatch at 5000 Hz, (d) elevation mismatch at 5000 Hz.
When the virtual sound source is at the height of the ear, the trend of flat-panel loudspeakers is more consistent with the positioning rules of traditional coaxial loudspeakers. However, when the sound source is lower than the ear level, ILD localization mismatch values become more pronounced.
IV. SUBJECTIVE EVALUATION OF VIRTUAL SOURCE DIRECTION
Due to the variations in principles and reproduction methods among different spatial sound techniques, there has yet to be an established standard for assessing spatial sounds, including evaluation criteria, methodologies, experimental conditions, and data processing. The International Telecommunication Union (ITU) has set standards for subjectively assessing spatial sound, which can be referenced in specific evaluations of spatial sound.42,43
A. Setup
Twenty-four normal-hearing listeners, 22–49 yrs of age, were involved in subjective tests. Among them, 13 were male, and 11 were female. The subjects sat naturally on a chair in an anechoic chamber with a flat panel located 0.7 m in front of them. During the listening test, the subjects were instructed to maintain their head orientation toward the direction of the unknown perceptual virtual source.43 They were also asked to maintain a stable body position while slightly rotating their head to keep their eyes and ears at approximately the same height as the center of the panel.
A pre-study with three groups of training was conducted. First, virtual sound sources of different azimuth angles in the same height were played from sequence left to right, namely “S51,” “S55,” and “S59” in Fig. 3, respectively. Then, virtual sound sources of different elevation angles in the median plane were played sequentially from top to bottom, namely “S15,” “S55,” and “S95.” Finally, the virtual sound sources in four corners were played, namely “S11,” “S19,” “S99,” and “S91.” Finally, the virtual sound sources in four corners were played. During the pre-study test, subjects were instructed to figure out the perceived direction of the virtual source once heard before being told the correct direction. Then, the subjects proceeded to the formal testing phase after a 5 min rest interval. Each equalized VBAP pink noise sample lasted for 5 s and was randomly presented to simulate the virtual sound sources from various locations. A 3 s pause was introduced between every two samples to allow the subject sufficient time to respond. During the localization experiment, the selection of the reporting method is of utmost importance. It should demonstrate an accuracy level at least as high as the human localization accuracy, which is approximately 1° for frontal sound incidence. Therefore, we opted for absolute evaluation over auditory comparison and discrimination experiments. Once the direction of the virtual source was determined, subjects were instructed to point a laser pointer in that direction. After confirming the direction, the subjects were asked to hit the bluetooth remote button (Xiaomi Co., Beijing, People’s Republic of China) connected to the phone. Meanwhile, a mobile phone positioned behind the subject was used to take a picture and record the localization of the laser mark. The laser pointer marks should be confined within the black frame indicated as the target region on the panel. The controlling computer was used to verify the results to avoid any omissions. If there was any error in operation, the subjects were allowed to revise their answers before the end of the audio playback. After a 3 s rest, the subsequent trial began immediately. If the virtual source was found to be indeterminate, the subject was asked to point the laser marker to a most-likely position. The listening test had a total length of 14 min for each subject to avoid fatigue.
The mobile phone archived the laser marks indicating the chosen azimuth and elevation angles. Given that the mobile phone may have shifted slightly due to ground shaking caused by walking on the steel net while changing subjects in the anechoic chamber, we carefully repositioned the mobile phone at the beginning of each subject's experiment.
B. Result
Figures 6 and 7 depict the mean and standard deviation of subjective localization mismatch. Subjective experiments exhibit a nuanced distribution compared to objective ones, with pronounced mismatch values in larger regions. For virtual sources positioned on opposite sides of the panel, perceptual azimuth angles align with traditional coaxial loudspeakers, indicating lower side accuracy. However, flat-panel loudspeakers show less accuracy for central positions than coaxial counterparts. The center localization azimuth angle accuracy for flat-panel loudspeakers at ear level is moderate. Upper localization slightly surpasses lower localization accuracy, suggesting a concentrated but imprecise lower localization area. Left horizontal localization is less effective than right localization; a trait that may be attributed to the right-handedness of all subjects when using the laser pointer.
(Color online) Localization mismatch between the perceptual virtual source direction (averaged among 24 subjects in the listening test) and the desired sound direction for each virtual source in Fig. 3. (a) Azimuth mismatch. (b) elevation mismatch.
(Color online) Localization mismatch between the perceptual virtual source direction (averaged among 24 subjects in the listening test) and the desired sound direction for each virtual source in Fig. 3. (a) Azimuth mismatch. (b) elevation mismatch.
(Color online) Standard deviation of the localization mismatch in (a) azimuth and (b) elevation among 24 subjects in the listening test for each virtual source in Fig. 3.
(Color online) Standard deviation of the localization mismatch in (a) azimuth and (b) elevation among 24 subjects in the listening test for each virtual source in Fig. 3.
Previous research on VBAP localization of coaxial loudspeakers indicates highest elevation angle accuracy only when the virtual sound source aligns with loudspeaker's height, lacking precision in other scenarios. However, this localization behavior is different for flat-panel loudspeakers. Notably, right-side horizontal localization performs worse than the left, while bottom-side vertical localization is less accurate than the top.
For the entire panel, the mean of azimuth mismatch values is within , generally superior to elevation mismatch within , implying more accurate horizontal than vertical localization. This aligns with the established notion of human ears having better horizontal than vertical resolution. Figure 6 reveals that when the virtual source aligns precisely with the target region edge, azimuth and elevation mismatch values are large. Yet, precision improves near this edge. For azimuth, when the virtual sound source is close to but not on the frame edge ( , like cross-lines labeled 2 and 8 in Fig. 3), mismatch values are relatively small within . Conversely, when the source is at the left and right borders , mean mismatch values are high within . A similar pattern occurs for elevation angles. We term this phenomenon the “edge-deterioration effect,” suggesting subjects shift the laser mark inward subconsciously to avoid exceeding control areas. The effect's substantial presence might arise from boundary conditions and actuator distribution constraints.
The mismatch observed in our experiment is related to various factors, such as the limit directional resolution of human hearing, the limitation of VBAP or pairwise amplitude panning itself, and the use of flat-panel loudspeakers as sources for reproduction. First, human hearing of a real sound source has the mean azimuth mismatch ranges from – , and the mean elevation mismatch in the median plane ranges from for white noise to for speech. Second, multichannel sound with conventional loudspeakers when using the VBAP algorithm has a similar localization mismatch pattern, that the azimuth mismatch is much better that the elevation mismatch. Specifically, the azimuth mismatches of the flat-panel loudspeaker are comparable to those of conventional loudspeakers using VBAP34 over the desired virtual sources at ( ), ( ). and ( ), where the difference in the median value, the interquartile range, or the data range, is within , except that the data range of the flat-panel loudspeaker at ( ) is larger. For elevation mismatches, the difference in the median value is within , while the interquartile range and the data range at ( ) and at ( ) are to larger, probably due to disparities in array configurations, as the triplets are placed differently in the flat-panel loudspeaker in our experiment and conventional loudspeakers.34
Thus, the perceived location mismatch can be largely caused by the limitation of VBAP itself, the limited directional resolution of human hearing, and the edge effect of the flat panel when the desired virtual source is geometrically close to the panel edge.
Standard deviations of azimuth and elevation mismatch values are illustrated in Fig. 7. Azimuth angles exhibit uniform distribution, with small deviations near real sound source actuators at vertices and slightly larger deviations elsewhere. In the upper part of the central panel area, deviations are smaller, probably due to improved localization accuracy resulting from slight head rotation. Elevation angle deviations are substantial around central height, contrasting with smaller deviations at top and bottom. This highlights optimal localization deviation not aligning with human ear height.
We also analyzed individual differences by the standard deviation of the subjects as shown in Fig. 8. Subjects 5, 8, 15, 21, and 22 show significant horizontal localization deviations, while subjects 1, 7, 11, 15, and 19 exhibit notable vertical deviations. Subject 15's overall deviation is substantial. Subject 8's azimuth mismatch deviation is large, with elevation mismatch showing smaller deviation, aligning with established vertical localization accuracy disparities. Spearman's ρ test between perceptual and ideal localization angles reveals a correlation range of 88%–98% (p < 0.05). This indicates a robust correlation between ideal and tested values.
(Color online) The analysis of individual differences through the standard deviation of the subjects. The abscissa represents the subject number, ranging from 1–24. The dark filled bar represents the standard deviation of azimuth mismatch, while the light filled bar represents the standard deviation of elevation mismatch.
(Color online) The analysis of individual differences through the standard deviation of the subjects. The abscissa represents the subject number, ranging from 1–24. The dark filled bar represents the standard deviation of azimuth mismatch, while the light filled bar represents the standard deviation of elevation mismatch.
Test results were categorized into two groups based on virtual sound source positions, both horizontally and vertically. The impact of these factors on perceptual azimuth and elevation angles was analyzed, resulting in four datasets. Normality was assessed using the Lilliefors test, and data equality was checked using the Bartlett's test. Given non-acceptance of null hypotheses at 5% significance level, the Kruskal–Wallis test was chosen over analysis of variance (ANOVA).
Further, the Kruskal–Wallis test was used to determine whether there are statistically significant differences between the azimuth/elevation of the virtual source and the mean azimuth/elevation mismatch of the perceived localizations. Results show that the mean perceptual azimuth mismatch significantly differs across various azimuth angles of the virtual sources (p < 0.001). Similarly, the mean perceptual elevation mismatch also significantly differs across various elevation angles of the virtual sources (p < 0.001). The mean perceptual azimuth mismatch differs less significantly than elevation mismatch across various elevation angles of the virtual sources but still satisfies the relationship with p < 0.001. On the contrary, the mean perceptual elevation mismatch across azimuth angles of the virtual sources is least impactful with p = 0.03. These results highlight the substantial influence of virtual source placement on both perceptual azimuth and elevation localization accuracy (p < 0.05).
C. Discussion
Though the experiment was conducted over a chosen panel with corner-positioned actuators, the findings may have broader implications. In application, the flat-panel loudspeakers can be used solely or in multi-panel setups for immersive sound with extended spatial coverage. Under VBAP, each panel can operate independently, so we assessed a corner-actuated single-panel scenario as a representative module. Since VBAP does not require the exact sound source or propagation, and considering auditory perception and masking effects, the result may hold for other flat-panel loudspeakers with comparable geometry.
This experiment assesses the performance of VBAP on the flat-panel loudspeaker, and the results can serve as a baseline for perceptual evaluation in current and future research on flat-panel loudspeakers, since VBAP is a simple but effective approach for spatial sound reproduction that does not require acoustic transfer function information. This study also further enhances the understanding of the transition of sound reproduction using the conventional loudspeaker to the flat panel from a perceptual aspect. Those findings extend the existing theory and practical value on flat-panel loudspeakers, especially for the auditory display in buildings.
V. CONCLUSIONS
This paper explored the localization mismatch between desired and perceived sound directions using amplitude panning with flat-panel loudspeakers. The study involved creating virtual sound sources of various locations and evaluating the perceptual source direction through both objective and subjective tests. The subjective tests resulted in a mean azimuth direction mismatch within and a mean elevation mismatch within . Additionally, the objective tests using the head and torso simulator and auditory localization cues indicated a good match. These findings suggest that the virtual source created by amplitude panning over a flat-panel loudspeaker can achieve spatial localization accuracy comparable to that of a real sound source without the need for complex algorithms or acoustic transfer function information. Future research will focus on optimizing algorithms for virtual source localization in spatial sound reproduction using flat-panel loudspeakers, along with perceptual evaluation.
ACKNOWLEDGMENTS
This work was supported by the Beijing Natural Science Foundation (Grant No. L223032). The authors would like to express gratitude to all the participants who took part in the listening test, Professor Feiran Yang for insightful discussions on binaural modeling, Dr. Yuzhen Yang for engaging discussions on acoustic modeling, and Cong Wang, Yong Chen, Kai Wang, and Xuyang Zhu for their assistance during the experiment.