This is my master theis. Through this study, I would like to investigates a personalization method that estimates people's Head Related Transfer Function (HRTF) using two Machine Learning models. A localization test with a VR headset was conducted to evaluate the perceptual accuracy of the personalized HRTFs.
Back to Protfolio
Virtual reality (VR) requires rendering accurate head-related transfer functions (HRTF) to ensure a realistic and immersive virtual auditory space. An HRTF characterizes how each ear receives sound from a certain location in space based on the shape of the head, torso, and pinnae, and provides a unique head-related impulse response (HRIR) for each given source location.
Since HRTFs are person-specific and difficult to measure, machine learning has become a popular way to estimate personalized hrtf.
On this page, I am going to explain the methodology I use to apply the machine learning algorithm as a way to generate HRTFS, and also the use of ABX and Localization testing to validate its feasibility.
Spatial Audio is creating a 3D audio experience by using headphones, and one important aspect of it is being able to hear where the sound is coming from.
And it turns out humans utilize two primary factors during sound localization, Interaural Differences and Spectral Cues.
Interaural differences are just time and intensity differences between our right and left ear when we hear a sound. The time difference is called Interaural Time Difference, ITD, and the intensity difference is called Interaural Intensity Difference, IID. And naturally, they are heavily related to one’s head size.
Besides ITD and IID, Spectral Cue also paly a big role in sound localization. As sound coming form different places and traveling to our ear canal, our unique pinna shape will produce different reflections of the sound. The add up of all these reflections creates different spectral cues between our ears.
A pair of Head-Related Impulse Response (HROR) incorporates all these aspects of sound localization. Its a pair of impulse response of a particular sound in a particular place. A Head-Related Transfer Function (HRTF) is HRIR presented in the frequency domain.
Due to the reason mention above, HRTFs are related to one’s body shape, so each of us has a unique set of HRTFs. We could measure HRTF by placing microphones in one’s ear and measuring all its impulse response to all locations. Naturally, this is expensive and time-consuming.
So it comes to our mind that we could use Machine Learning to estimate personalized HRTF.
We have the CIPIC HRTF Database as our training data, it contains
Previous research by Chun etc. (2017) has shown some promising results. While trying to replicate their findings using a Fully Connected Neural Net, although we could get lower MSE (-26.8 dB), the estimated HRIR did not resemble the actual one on its shape, so we reexamine our HRIR.
If we take a closer look at one pair of HRIR, we could see that it actually consists of two parts that in some way correspond to what we have discussed before, interaural differences and spectral cue. So we could align all the HRIR with respect to its peak, thus, Isolating these two factors out and estimate them separately.
Now if we just look at the shape of HRIR as we have aligned them all, we could find something interesting. If we took subject 44’s HRIR at the same elevation but mirrored azimuth angle, we can see that these two look very similar. Why is that? We conclude that it is because the shape of HRIR is heavily dependent on one’s pinna shape, which shouldn’t differ much between the same person’s left and right ear. So what we could do is mirrored our dataset by flipping the right ear data to the left side.
So we have now isolated our HRIR into two core components and mirrored our dataset, we can go ahead and train our models. For this study, we use a Fully Connected ANN to estimate the shape of our HRIR, and we use a Gradient Boost Tree to estimate ITD.
The neural network showed a significant improvement on the HRIR shape.
The estimation on ITD also demonstrating just a little bit of an error, mostly fall under just noticeable differences.
And here is the complete reconstructed HRIR from our inference.
I presented our study on the Acoustic Society of America (ASA) 177th Meeting, you can access to the recorded presentation here
The abstract of this study is published in The Journal of the Acoustical Society of America 145, 1883 (2019); https://doi.org/10.1121/1.5101823
PS: Since it is still an ongoing study, I will post the code and unfinished materials after I finished this project.