We are obsessed with reducing latency, because we have observed aspects of 1:1 interaction which are broken by too much latency. In the head-mounted-display world there has been a similar discussion, with writers like John Carmack, Valve, and Oculus Rift writing about different mechanisms for detecting and mitigating latency in display and rendering technology. In our initial work, we have also started using high-speed cameras and blinking lights to do ‘black box’ visual latency testing, but we needed to build a different solution for audio, and have been testing various types of latency with it, including the speed of sound in air, Skype, and cell phones.
We connect two high quality microphones directly to the two input channels of a digital oscilloscope, and we then use either a metronome or simply snapping our fingers or clapping to create a sharp audio signal that can be detected by both microphones. By positioning one microphone at the input of an audio system and the other at the output, we can then easily and reliably use the scope to capture the delay over multiple samples down to millisecond resolution.
A fun and easy way to verify that this system works is simply to have both microphones separated by 10 feet or so, and then make a loud enough sound at one that can also be heard at the other. The speed of sound in air is about 1 foot per millisecond (0.888 to be exact), so you can easily verify on the scope that you are in fact living in the world of classical physics and not in the matrix (or at least that’s what they want us to think).
Doing this testing with a variety of cell phones on different carriers and traditional telecom landlines indicates that if you want to enjoy witty repartee with others, at least in San Francisco, you are much better going with Verizon than AT&T or T-Mobile. Verizon’s 1-way latency from our tests is about 280msecs, compared to 400–450msecs for AT&T and TMobile.
Skype, by comparison, generally outperforms the cell phones in terms of end-to-end latency: we measured audio delays of from 100–200msecs for various combinations of audio and video calls, where the two endpoints were on the same WiFi network. So this means that with a packet delay of about 40 msecs (which is what we typically see when pinging Boston from San Francisco), a cross-country audio or even video call on Skype is going to come in with about 250msecs of delay and be a bit better than using a cell phone.
On our own internal work on shared audio for High Fidelity, we’ve been able to get the audio delay down to about 75–90 msecs. We’ve also tested the experience quality of different amounts of audio delay, and found that less than 50 milliseconds is the point where when hearing one’s own voice the delay becomes imperceptible, and at less than 125 milliseconds or so the difference between the audio and video feeds of another person speaking usually becomes indistinguishable.