Isolation. Travel restrictions. Same four walls.
Video and web conferencing only help so much to create meaningful human connection in our new COVID-induced normal and do little to feed our cultural appetites.
Imagine a web application that combines video conferencing technology and spatialized audio to create an immersive experience of togetherness in the physical world.
Ideally, this hypothetical application would have to provide:
- A shared environment that is highly detailed and realistic and at the same time permits exploration on a massive scale.
- Tools to share our virtual location and orientation among other users for an accurate avatar representation.
- Low latency, high quality spatialized audio to help the user feel fully immersed.
- The possibility of sharing your webcam with others as part of your own avatar.
Using High Fidelity’s Spatial Audio and Twilio To Meet on Google Street View
With the StreetMeet experiment, we were able to achieve all of these goals by blending High Fidelity’s Spatial Audio with Twilio and Google Street View API to create a unique experience that allows you to meet friends, family, and colleagues anywhere. The result is the ability to see and listen to each other face to face while exploring an existing physical environment together.
StreetMeet is an application that shares a lot of similarities with any other Augmented Reality application out there, where a three-dimensional scene is projected on top of a real image or camera feed. In our case, we will be using a static panoramic image taken from a fixed point provided by Google Street View. Even though this allows us to use certain simplifications, ultimately it will end up imposing some constraints that we will have to deal with.
We need to create a way to represent all of the users that are connected to our 3D world with an extremely simple design but one that is able to transmit a unique personalized image of each user. Our design resembles a floating monitor pivoting around two anchor points on each side. The dimensions of the model have been chosen to guarantee an accurate perception of the avatar position and orientation from any other point on the scene.
We use Twilio to get other users’ webcam video feed and render them on the screen of the avatar model as a texture. Click here for a complete example of how to integrate Twilio and High Fidelity’s Spatial Audio API in your application.
With High Fidelity’s Spatial Audio API we will be able to send the microphone audio stream together with the avatar position and orientation, and receive the resulting spatialized audio together with all of the data (positions and orientations) necessary to render other connected avatars accurately on the 3d scene with Three.js. Click here for a full guide on how to use High Fidelity’s Spatial Audio API on your projects.
Since the frequency of the data received is lower than the frequency of our rendering loop (60Hz), we interpolate this data on the client in order to render smoother transitions between positions and orientations at every frame.
The name and color of each avatar is computed based on the ID that is assigned to them by the Spatial Audio API. Keep in mind that these IDs are generated randomly and will change when a new connection is established.
Google Street View API provides the tools to explore and visualize, via the web, 98% of the inhabited Earth (over 10 million miles) in the form of a series of panoramic images captured along multiple paths, including parks and the interior of monuments and museums.
To use Google’s API, we should log in on Google's Cloud Console with a Gmail account, create a new application that makes use of the Street View API, and generate an API key for our application. Here's more info on how to set up in Google's Cloud Console.
Google Street View API adds its own canvas and controls to the DOM. It also provides some useful callbacks that will help us draw our scene on top of their canvas, using a classic Augmented Reality implementation consisting of an overlapping WebGL canvas to render our avatars with Three.js.
Now our main challenge is to try to maintain some degree of spatial coherence between both scenes. For this we need to estimate some parameters that are not explicitly exposed by Google’s API.
We need some extra information in order to properly synchronize the views from both canvases:
- Panorama camera position from GPS coordinates. Street View will render a panoramic image based on the latitude and longitude coordinates used as input on the API. We need to find a function that maps these coordinates to our local orthogonal projection.
- Movement paths from GPS coordinates and azimuth. Every time a new panorama gets loaded, a set of links are generated with coordinates and the azimuth (heading) value used to render the control arrows. We can use this information to estimate the direction of the path and the latitude and longitude to estimate the distance between panoramas.
- Transition speed and interpolation along paths. When transitioning from one panorama to another, Street View seems to use a fixed amount of time to cover the distance. We have estimated a transition time of half a second, with a linear movement interpolation.
- Three.js camera POV (point of view) and rotation from panoramic camera zoom, heading and pitch values.
We can extract the panoramic camera rotation quaternion from the provided heading and pitch values. First we compute a point in space using heading and pitch and then we use it as a look at point for the Three Camera.
- Panoramic camera height. This varies among different panoramas. We have estimated that it might be between 1.5 and 2.5 meters, but since we are rendering floating avatars we can get away with this for now.
We can think of a meeting point as a cylindrical space with the center corresponding to the position of the panoramic camera, and where a specific number of users can congregate. We need to find a strategy to avoid all users ending up on the same point, since all of them will be using the same panorama as their background. We can limit the amount of positions that a user is allowed to occupy inside a meeting point. These positions are assigned according to these two rules:
- Assign the same relative position if possible. When a user moves from one meeting point to another along a connecting path, it will try to maintain the same relative position that it was occupying on the previous meeting point. This will guarantee that users move from one meeting point to the next following lines that are parallel to each other and also parallel to the one followed by the virtual camera path provided by Street View.
- Assign the next best position. If the same relative position on the target meeting point is already taken, we compute the next position that is not occupied but contributes to a resulting natural arrangement. This is achieved by occupying concentric circles around the center of that meeting point.
Web technology is evolving to a point that allows developers to create, with a few lines of code, a compelling application that is one click away for most web users. With a little more effort, and by selecting and combining the best APIs out there, we can create a useful product that offers a completely different experience.
Spatial audio delivers an incredible, synchronous, and immersive experience. Recently, we paired High Fidelity’s Spatial Audio with Zoom and the results were amazing.
Getting started with High Fidelity’s Spatial Audio API has never been easier. Start by following our simple guide. From there, the guides page has additional documentation related to other applications built using our API. With just a few lines of simple code, you can create a high-quality, real-world audio experience for users.