You would be less tired at the end of the day.
As the video switches back and forth between the original Zoom audio and our Spatial Audio, do you notice how much easier it is to understand who is speaking when Spatial Audio is on? It should be easier to make out the words when two or more people talk at the same time, too. Over the course of the day, it really makes a difference in reducing fatigue and enjoying conversations.
How We Did It
We set up a Zoom call and a meeting in one of High Fidelity's sample applications (a virtual audio space) at the same time — so all six of us were connected (on a mix of PCs and MacBooks) to both services at once. Two different people then used a screen-recorder to capture the whole session, one in Zoom and one in High Fidelity. In the High Fidelity session, the ‘middle’ person was the one whose audio you actually hear. Then, in the final video, we switched back and forth between the audio recordings. Below, you can see and hear the original recordings of the Zoom call and the High Fidelity gathering.
The Science Behind Spatial Audio
You might be wondering how to convince your brain to hear something in front or behind yourself, or on the right or left. Your first guess might be that if something is on the left, we make it louder than on the right, sort of like adjusting the balance slider. Strangely enough, that guess is wrong, except for the special case called ‘Near Field’ where we also want to make something sound super close to your ear (like those ASMR recordings).
Instead, the way we move sounds around left and right or in front and behind you actually uses two different tricks. Collectively, these techniques are often referred to by the acronym ‘HRTF’, which stands for ‘Head Related Transfer Function’. The long name really captures two fairly simple things. The first one is just the time delay between your ears: A sound coming from your right gets to your right ear a little before it gets to your left ear, so we delay the arrival at your ‘far’ ear to create the effect. The second effect is a little more complicated, and is why our ears have such a funny shape (called the ‘Pinna’). Basically, the different frequencies of a sound are shifted differently for every direction the sound could be coming from. Your brain is trained to recognize those changes for any sound, and tell you which direction the sound is coming from. For example, the higher frequencies of a sound coming from behind you are harder to hear because they get muted by the skin of your ear (that they have to pass through). You can experience this yourself if you find a very quiet room and then rub your thumb against your fingers while moving them around your head. You will hear the frequencies shift higher and lower for different directions and angles. Want to read a lot more and learn about stuff like why we tilt our heads to better spatialize? The Wikipedia article on Sound Localization is good.
So to spatialize audio, we take the original sound and do these two things — shift the time delay between the two channels and adjust the loudness of the frequencies — according to where that sound is supposed to be relative to your head. Although you don’t hear it much in this video, we also adjust the frequencies and loudness of sounds depending on how far away they are from you, which further helps with conversations in a larger group.
Try It For Yourself
You can do the same thing we did by using this link to start your own online audio space for your meeting and then moving around just like we did. Our beta is free for up to 20 people, with no time limits.
Why Haven’t Zoom or Teams Already Added Spatial Audio As A Feature?
Because it’s quite difficult. We’ve been working on it for six years with a highly skilled audio engineering and networking team.
When you hear spatial sounds in video games, the spatialization (as described above) is done on your computer or phone, right before the sounds come into your headphones. But if you have many sounds coming from different directions (say two people talking to you from the right and the left, at the same time), this approach breaks down. If there are too many sounds to process at once, your phone or computer will run out of CPU and start breaking up, because spatialization is compute-intensive. As you can imagine, a chip in your AirPods or iPhone isn’t going to be able to spatialize 20 people talking at once! And even if they could, it still wouldn’t work because you’d need enough bandwidth to send the original audio of all 20 people down to your phone or headphones or laptop for local spatialization. Finally, the spatialization has to be combined with another critical requirement for communication audio — low latency. If the spatialization adds latency (and often the client-side approach does), the conversation will start getting broken up by people talking over each other, like it does on a cell phone call or a laggy video connection.
To address these problems, we’ve moved the spatialization to the cloud. That way, each person can receive just one stereo stream of audio that has the spatialized sounds of everyone talking, already fit into it. We’ve designed the spatialization into the compression and mixing, and designed it to work with short audio frames, so that no latency is added.
Spatializing the audio means that we can support hundreds of people at the same time, as you can see (and hear) in the video below of a crowd environment.
Hopefully Coming Soon
In summary, the good news is that other than the usual integration work, it is straightforward to build this new cloud-based spatialization into existing videoconferencing solutions. Hopefully you will see it coming soon to your favorite video platform, and can look forward to being able to understand people a bit easier and have less fatigue from a day full of meetings.
Interested in Spatial Audio API? Learn More.