HRTF For the Masses: Current Approaches and Challenges / by Xiao Quan

            The perception of spatial sound is an inherent part of our natural listening experience. However, spatial sound reproduction remains to be a challenge when it comes to commercial implementation in 2020. Binaural sound through headphones seems to be the most cost-effective way, hardware-wise at an individual level, to bring the experience of virtual spatial audio to the masses, yet obstacle remains. In this essay, I will briefly describe the factors that influence how we perceive spatial sound. I will then explain what HRTFs are, how they can be measured, modeled, or selected; and lastly, what are current approaches and challenges of finding the right fit of HRTFs for the average consumer.

            Three main factors influence how we locate a sound source: Interaural Time Difference (ITD), Interaural Intensity Difference (IID), and Spectral Shaping by the Pinnae (Outer Ear) Structures (Wenzel et al., 2017). While ITD and IID are primarily responsible for helping us locate sound on the horizontal plane, our pinnae structure helps us locate sound on the vertical plane. The ITD and IID are determined by the size of our head and upper shoulders: the width and density of our head and shoulders cause subtle time and intensity differences for sounds coming from the side compared to sounds coming from the front. This subtle interaural difference of intensity and time enables us to determine the spatial characteristics of a sound source on the horizontal plane (Middlebrooks & Green, 1991). Recent research has shown that for broadband sounds, ITDs are responsible for determining the location of frequencies up to 4000 Hz (Bernstein, 2001), whereas for temporal fine structures in sound, ITDs are useful when identifying the location of frequencies up to 1400Hz (Brughera, Dunai, & Hartmann, 2013). On the vertical dimension, the material and shape of our outer ear (pinnae) act as a spectral coloration filter to all incoming sounds. This coloration effect of the pinnae is highly directionally dependent, especially on the vertical dimension. Therefore, our brain can use this variation in frequency to determine the vertical location of a sound source (Searle et al., 1975; Wenzel et al., 2017).

            All of the factors above that influences how we perceive spatial sound can be measured and transferred into mathematical functions, with azimuth (q ), elevation (f ), distance (d), and angular frequency (w ) as variables, known as ‘Head-Related Transfer Functions’ (HRTFs). These transfer functions can then be selected and applied onto an audio signal to mimic how it is perceived as it reaches our ears, offering it spatial characteristics (Roginska, 2017). The acoustic measurement process for HRTFs is expensive and laborious. To do it properly, one must first create an anechoic environment. Then, a speaker array along various points on the vertical plane, equidistance to the subject must be setup. For the horizontal plane, we can either rotate the test subject or the speaker array to obtain the positional data of the test signal. Next, the subject (which can either be a human being or a test mannequin head) needs to have binaural microphones inserted in his or her ears. The subject must remain stationary while test signals are being played from various points along the speaker array in the virtual sphere that surrounds the subject’s head. These signals are then picked up by the binaural microphones. The differences in spectral information between the recorded signal and the original signal are coded in alignment with the changes in the aforementioned variables, such as azimuth and elevation, to formulate the HRTFs of the subject. The whole process takes at least an hour (Roginska, 2017).

            In an ideal world, one set of laboriously measured HRTFs can be generalized and applied to audio signals so that everyone can enjoy spatial sound in 3D audio applications. However, studies have shown that individual variations in pinnae structure contribute significantly to where the notches of spectral coloration are located (Middlebrooks & Green, 1992). Thus, a particular set of HRTFs measured for one person can yield vastly different extents of sound localization performances for others. Yet it’s unrealistic to create customized HRTFs for every individual consumer. Therefore, methods need to be invented to obtain the optimal balance between the accuracy of HRTFs and the ease of acquiring them, for binaural sound design to be commercially viable.  

            Besides individual acoustically measured HRTFs, there are three main approaches aimed at overcoming this dilemma. 1) Reconstructing HRTFs based on 3D model scans of the test subject (Katz, 2001). 2) User-selected HRTFs with customized IID and ITD characteristics, based on simple measurements of head width and torso size (Algazi et al., 2001). 3) User-selected HRTFs from database (Roginska et al., 2010). In conclusion, the path moving forward for HRTF selection and synthesis for the optimal balance between accuracy and cost is an ongoing area of binaural research. Various fields such as gaming and virtual reality have recently rolled out hardware support for processing 3D audio. The results of upcoming studies on this topic will directly affect how we experience reproduced sound in the future.

           

 

Citations

Algazi, V. R., Duda, R. O., Morrison, R. P., & Thompson, D. M. (2001, October). Structural composition and decomposition of HRTFs. In Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No. 01TH8575) (pp. 103-106). IEEE.

 

Bernstein, L. R. (2001). Auditory processing of interaural timing information: new insights. Journal of neuroscience research66(6), 1035-1046.

 

Brughera, A., Dunai, L., & Hartmann, W. M. (2013). Human interaural time difference thresholds for sine tones: The high-frequency limit. The Journal of the Acoustical Society of America133(5), 2839-2855.

 

Katz, B. F. (2001). Boundary element method calculation of individual head-related transfer function. I. Rigid model calculation. The Journal of the Acoustical Society of America110(5), 2440-2448.

 

Middlebrooks, J. C., & Green, D. M. (1991). Sound localization by human listeners. Annual review of psychology42(1), 135-159.

 

Middlebrooks, J. C., & Green, D. M. (1992). Observations on a principal components analysis of head‐related transfer functions. The Journal of the Acoustical Society of America92(1), 597-599.

 

Roginska, A., Santoro, T. S., & Wakefield, G. H. (2010, November). Stimulus-dependent HRTF preference. In Audio Engineering Society Convention 129. Audio Engineering Society.

 

Roginska, A. (2017). Binaural audio through headphones. In Immersive Sound (pp. 88-123). Routledge.

 

Searle, C. L., Braida, L. D., Cuddy, D. R., & Davis, M. F. (1975). Binaural pinna disparity: another auditory localization cue. The Journal of the Acoustical Society of America57(2), 448-455.

 

Wenzel, E. M., Begault, D. R., & Godfroy-Cooper, M. (2017). Perception of spatial sound. In Immersive sound (pp. 5-39). Routledge.