Are you one of the 10 million weekly active users on Clubhouse who have listened in this past week? If so, you’ll notice a key difference — being able to more clearly understand the speakers on stage as they have a natural back-and-forth conversation. You feel as though you’re actually *in the room* with them, sitting at a dinner table all together.
How was this achieved? Clubhouse integrated High Fidelity's spatial audio to make conversations feel more natural and immersive.
And Clubhouse isn’t the only one: Skittish, a new playful space for online events, also uses spatial audio. “Many of the design decisions — no camera, spatial audio, animal avatars, the visual look of the space — were designed to get people comfortable interacting with others. From the moment you join, it’s clear this isn’t a meeting or conference call,” says Andy Baio, creator of Skittish.
Remote workers have certainly felt the need for this sort of more natural conversation this past year and a half, too. Tanya Basu writes for MIT Technology Review, “Indeed, many of the innovations in meeting technology over the past year have focused on re-creating the ‘water cooler’ moments that help employees bond. These low-stakes conversations (about weather or sports or TV, perhaps) are crucial to creating a sense of trust and perspective for future problem-solving. But those interactions require a sense of connection — one that Zoom boxes aren’t conducive to creating.” Proximity chat platforms and social audio apps are two important components of that — it’s clear there’s a future there.
So how does “spatial audio” work to reduce negative effects of virtual communication? Let’s go over some great research.
Reduce Negative Effects of Virtual Communication with Spatial Audio
First, a quick summary of the research: Spatial audio improves speech intelligibility while reducing cognitive load. Now we’ll break that down into more easily understandable terms.
“Speech intelligibility” is defined as how clearly a person speaks so that his or her speech is comprehensible to a listener. “Reduced speech intelligibility leads to misunderstanding, frustration, and loss of interest by communication partners.”
(We’ve already discussed some cool research on why audio-only interactions are easier to comprehend than those with video in another article. Here, we’ll focus specifically on spatial audio.)
During the experiment, people had to report the speech emitted by a “target” talker in the presence of a concurrent “masker” talker. Imagine you are standing at a party, chatting with another person. There are people around you chatting simultaneously (“masker” talkers). The ability to focus your attention on the person you’re speaking with, while filtering out others, is called the “cocktail party effect.” So in the research, Guillaume Andéol et al. investigated what conditions made it easier for people to understand each other in this sort of scenario.
Guillaume Andéol et al. writes, “Previous studies (reviewed in Bronkhorst, 2015) have found that two cues are particularly important: the voice frequency characteristics and the spatial separation between talkers. The most favorable voice characteristic condition is reached when the target and the masker have different genders. Likewise, higher intelligibility can be attained when the target and the masker are spatially separate.”
“Spatially separate” means that the voices are coming from different locations in space. How does High Fidelity’s Local Spatializer handle that for Clubhouse? Ken Cooke, principal audio engineer at High Fidelity, says: “Our HRTF filters continuously interpolate to avoid artifacts when sound sources move. We handle the extreme dynamic range of many people talking at once (or right into your ear) without distortion. And finally, the algorithms have been optimized to preserve battery life on mobile devices.”
Spatial Audio Decreases Cognitive Load
Then, Andéol et al. looked at “cognitive processing load” (assessed by a prefrontal functional near-infrared spectroscopy), meaning: How taxing is an activity on our brains?
First, briefly recall Jeremy Bailenson’s research on Zoom fatigue includes increased cognitive load from using video in a virtual meeting. “Participants in the video condition made more mistakes on the secondary task than in the audio condition. In explaining the reason for the increased load from video, Hinds argues that dedicating cognitive resources to managing the various technological aspects of a videoconference is a likely cause, for example, image and audio latency. On Zoom, one source of load relates to sending extra cues. Users are forced to consciously monitor nonverbal behavior and to send cues to others that are intentionally generated. Examples include centering oneself in the camera’s field of view, nodding in an exaggerated way for a few extra seconds to signal agreement, or looking directly into the camera (as opposed to the faces on the screen) to try and make direct eye contact when speaking. This constant monitoring of behavior adds up.”
Indeed, it does.
And when Guillaume Andéol et al. looked at cognitive load, they found “Spatial separation can dramatically improve speech intelligibility without increasing the cognitive load. In fact, the spatial separation can even decrease the cognitive load for [specifically] the intermediate target-masker-ratio (TMR). The cognitive resources of listeners can often be limited in everyday life situations, either by age or pathology or when task demands exceed the listener's mental capacity — for instance, in a multitasking environment. Moreover, those same people can also suffer from low speech intelligibility because of weak or no access to spatial cues.”
Janto Skowronek and Alexander Raake found the same in their research when assessing cognitive load, speech communication quality, and quality of experience for spatial and non-spatial audio conferencing calls. “Spatial audio reproduction, i.e. system properties, can reduce cognitive load,” they write.
Spatialized audio is important.
Add Spatial Audio To Your Native App For Better Virtual Communication
The best part of all this? It’s now possible to add spatial audio to native apps (such as Clubhouse). High Fidelity’s Local Spatializer is a C++ codebase that works with your existing audio streaming. Each audio stream is automatically spatialized and designed to be independent of your native application’s user interface. Curious to learn more?
Taylor Hatmaker describes Clubhouse’s spatial audio experience: “While Clubhouse and other voice chat apps bring people together in virtual social settings, the audio generally sounds relatively flat, like it’s emanating from a single central location. But at the in-person gatherings Clubhouse is meant to simulate, you’d be hearing audio from all around the room, from the left and right of a stage to the various locations in the audience where speakers might ask their questions.”
Virtual communication is only going to continue to increase — whether it’s in social audio apps, virtual conferences, digital networking events, or videoconferencing, spatial audio is one way to improve conversation between humans.