Dissonance Voip Pipeline In Which A Pipeline Is Dissected
TL;DR
I’ve released a voice communications asset on the Unity store!
A Warning To Readers
This blog post was written in February of 2017, just a couple of week after the very first version of Dissonance released onto the asset store. I wrote it because I think it’s a pretty fascinating topic with a lot of subtle complexity that’s not initially apparent.
Almost everything about the pipeline has changed/improved since this was written. If you’re making modifications to Dissonance be cautious about using this post as a reference!
Placeholder Software
In my 2016 retrospective I mentioned that I had started a company called Placeholder software and that we had released Dissonance Voice Chat onto the Unity store. Dissonance has been available for a little over a month now and we’ve been working flat out fixing bugs, fulfilling feature requests, upgrading bits and pieces of the system, and providing support. It’s a part of software development I’ve not done much of before and it’s been a lot of fun!
This blog post is not just an advert for Dissonance - I’m going to break down the long and complex process involved with getting high quality VoIP to work (specifically in Unity, but the concepts obviously are very transferable to any platform).
Voice Pipeline
First off, what is a pipeline and what does it do? For a good conversation over VoIP we need 5 things out of the pipeline:
- Low Latency
- High Quality Audio
- Low Bandwidth Usage
- Tolerance To Packet Loss
- Tolerance To Clock Skew (transmitter and receiver clocks running slightly out of sync)
Getting all of these things to work at the same time can be quite challenging! A further requirement for Unity is that as little of it runs on the main thread as possible - that runs all of the game logic and we don’t want a low frame rate disrupting voice quality.
Digital Signal Processing Basics
Before we look at the complete Dissonance pipeline it’s important to know a little about how digital signal processing works.
Signal
A signal is just a sound - in physical terms that’s a wave. There are many ways to represent a signal - the most common is to store the displacement at each moment in time. Each of these displacement values is a sample.
Sample
A sample is a single numeric value in the signal. There are a lot of ways to represent numbers in a computer - the most common formats for audio are 16 bit integers, 24 bit integers or 32 bit floating point numbers. Unity uses 32 bit floats for all of it’s audio and so does Dissonance.
Frame
A digital signal processing pipeline does not operate on a stream of individual signals - it instead operates on blocks called frames. Each frame is a short period of time (for example in Dissonance this can be tweaked to 20ms, 40ms, or 60ms). There are some places in the pipeline where samples are added to a buffer and at these points the frame size is generally being converted (hence the buffer, to accumulate the excess samples).
Sample Rate
Recall that a signal is formed of a series of samples which store some data about the signal - the sample rate is how frequently the underlying signal (which has conceptually infinite resolution) is converted into samples. A common audio sample rate is 44100Hz, or 44100 samples every second.
The Dissonance pipeline operates at different sample rates at different places - the resample steps in the pipeline are where the sample rate is changed to whatever we need it to be.
Ok now we have the glossary out of the way let’s take a look at the complete pipeline for Dissonance - this is everything from the microphone on the sender side to the speaker on the receiver side. We’ll break this into bits and look at what they all do:
Click the image to see it in a new window.
Wow that’s larger than I expected when I started making the diagram! First off let’s break this down and look just at the sender.
Capture Pipeline
The sender side of the system is called the “capture” pipeline because it captures the audio from the user.
The work of the capture pipeline is split across two threads. The Mic Capture
thread in this diagram is the main Unity thread - we want to do as little work as possible here! The Unity API is completely single threaded so we have to read the data off the mic on the main thread, but we want to move it off to somewhere else as soon as possible.
The Unity API for reading data from the mic looks like this:
public bool GetData(float[] data, int offsetSamples);
Unity keeps a rolling buffer of audio from the mic (in our case up to 1 second long) and then we can read data from it as it becomes available. Once a second has passed it wraps around and begins overwriting the old audio data. In an early version of Dissonance we read from this buffer once enough audio was available to process (one frame, usually 40ms) however this turned out to produce subtle audio artifacts every time it looped around. To fix this we moved to a new technique where we read data from the microphone as soon as it is available and keep our own buffer of data until a frame is ready to go. This is the first two steps in the diagram.
Once a complete frame is in the buffer it is copied out of the buffer and into an array of the correct size - this array is then sent to the encoding thread for the real work to happen. The encoding thread spends it’s entire lift processing frames, when no frames are available it sleeps until a new frame is delivered from the main thread.
Preprocessing
The preprocessing step itself consists of two phases - we first want to run the Voice Activation Detector (VAD) on the plain audio coming from the microphone and then we want to clean up the audio (noise removal, automatic gain control). These steps have to happen in this order because the VAD can get very confused if gain control is done first!
The VAD and the preprocessor are both external open source components - the WebRTC VAD and speexdsp. Both these components work with 16 bit integer audio but the Dissonance pipeline natively operates with 32 bit floating point audio, and in addition the VAD only operates at certain specific sample rates.
The microphone hardware may not supply the sample rate we want so the very first step is to convert the audio to the right sample rate (the capture rate mentioned in the diagram) and then convert it to 16 bit audio. The floating point value will be in the -1 to 1 range so the conversion is very simple:
for (var i = 0; i < count; i++)
{
var sample = input.Array[i + input.Offset];
//Clip the sample into the allowable range
short converted;
if (sample >= 1.0f)
converted = short.MaxValue;
else if (sample <= -1.0f)
converted = short.MinValue;
else
converted = (short)(sample * 0x8000);
output.Array[i + output.Offset] = converted;
}
And back to float again:
for (var i = 0; i < count; i++)
output.Array[i + output.Offset] = input.Array[i + input.Offset] / 0x8000;
Once the conversion is done the data is pushed into the VAD which will classify the frame as speech or not-speech. After that the data is pushed through the speex preprocessor - this runs two processes: Automatic Gain Control (AGC) and Noise Removal. AGC automatically tweaks the volume of the input signal so that the output signal is always roughly the same volume - this means that in a group conversation with a collection of different people, speaking different volumes to different microphone hardware everyone will sound roughly the same volume. Noise removal… removes noise.
The next step of the pipeline is the somewhat odd sounding delay buffer. Recall that one of the things this pipeline should do is be low latency - so why is there a deliberate delay buffer? Voice detectors are not perfect and one place they particularly struggle is the first frame of voice - it might only be 25% voice but it still needs to be classified as voice otherwise the start of what someone says is cut off. The delay buffer delays the voice signal by one single frame but the VAD operates before the buffer - this allows the VAD to have ~20-60ms of foreknowledge and almost entirely fixes the cut off problems.
Finally, once all this is done (assuming the VAD is active, or push-to-talk is pressed) we need to transmit the audio to the other players. To do this we need to encode the audio using a codec which will reduce the size of the raw data from the rather ridiculous:
48,000samples/second * 4 bytes = 192,000bytes/second
For this, Dissonance (and almost every other VoIP software) uses Opus. The range of bandwidth Opus can use depends upon quality settings - it’ll be somewhere between 750bytes/second (extremely low quality voice) up to 63,750bytes/second (extremely high quality full orchestral music).
The Network
Once the encoding side has produced a packet of encoded audio we need to send this across to the other people in the session so they can listen to it. Dissonance doesn’t require any particular network architecture (there’s a basic interface which you can implement to provide any network architecture you like). However, there’s a default implementation of that interface for a basic client server architecture. Here’s the pipeline for that architecture:
Pretty basic stuff. The sender transmits the packet to the server on the encoding thread (once again, minimising the work on the main game thread). The server determines which clients need to receive this packet and forwards it on to them.
When the client receives the packet we don’t really know what thread we’re on - the receive method is called by user code which integrates with their network integration. We assume this is probably the main thread and so the packet is copied from the receive buffer and sent over to the decoding thread as soon as possible.
Playback Pipeline
Finally, we have the playback pipeline which once again operates in a separate thread to the main thread. It reads packets from the network “transfer buffer”, parses them and plays them back.
Jitter Buffer
We want our voice to be played back as quickly as possible (low latency). However this can present a problem - if we play back a packet as soon as it arrives then the other packet must arrive exactly on time otherwise we’ll have nothing to play back! The pipeline can handle not having the next packet available - it invokes a part of Opus called Packet Loss Concealment and essentially just makes up some sound to fill the gap - but this doesn’t sound great and we don’t want to use it often.
The jitter buffer fixes this situation by storing enough packets to smooth out the jitter in arrival time. For example if we have 100ms of audio in the jitter buffer then the next packet can be up to 100ms later than normal and it’ll still sound ok. The jitter buffer in Dissonance attempts to size itself dynamically by starting out with a conservative size (100ms) and then shrinking itself (by playing back audio a tiny bit faster) as it measures the actual network jitter.
Volume Ramping
When someone starts or stops talking it can often cause a nasty sounding click because the signal is discontinuous (suddenly skips from silence to speech in one instant). To mitigate this, volume ramping detects if this is a transition frame (first or last) and ramps the volume up or down over the length of the entire frame. This does not totally remove the discontinuity but it does reduce it below the audible threshold.
Convert Frames To Samples
Up to this point (all the way from early in the encoding pipeline) we’ve been handling things one frame at a time. However the Unity audio system does not want frames of the same size - it comes along whenever it wants and demands however many samples it needs.
This stage of the pipeline removes the frames we’ve been handling so far and allows the rest of the pipeline to pull things one sample at a time. Conceptually this is pretty simple - just read a frame and then keep returning samples from it when asked, and when you run out of frames read another one and continue. In practice this turns out to be rather complex - reading one sample at a time is far too inefficient, so instead the converter has to try and read data in the largest blocks possible. Additionally, there is some metadata flowing through the pipeline alongside each frame - the converter has to keep track of the metadata for samples as it reads them and return the correct metadata alongside each block of samples.
Soft Clipping
It’s possible that the encoding pipeline produced a bad audio signal with clipping in it (the signal attempts to go louder than max and just tops out). This could be caused by overeager gain control, a badly configured microphone or just someone shouting loudly! Clipping is one of the worst sounding problems in an audio pipeline - in fact it can be downright painful if you’re using a loud headset.
Soft clipping (which is part of Opus) distorts the signal to remove the horrible clipping artifacts by slightly smoothing out the bits which clip. This isn’t a perfect solution because it introduces slightly incorrect harmonics, but they won’t even be perceptible unless there’s some major clipping going on.
Playback In AudioSource
Finally, we get to the end of the pipeline. This is implemented using the OnAudioFilterRead
method of Unity. The unity audio thread comes along whenever it needs new data to play, pulls data out of the decoding pipeline, splits the data out so the same thing is playing on all channels (voice data is totally mono channel), and that’s it! From this point on the Audio passes through the Unity audio system just as if it’s a playing sound effect.