Success with Convolution and the Path to a Better 3D Mixer

I have spent my weekend working on a convolver, and have had a great deal of success. The motivation for this is to get away from the OpenAL API which, as described previously, is actually really poor in terms of getting things done. Given my success in the last 28 hours approximately, this is a doable goal.

A convolver as I understand it is a weighted average computed per-sample. This is a correct explanation, but misses out on the mathematical subtleties that are, in this case, important. I do not now why it works but I do know what it does: when an input stream and an impulse response are convolved in the time domain, their frequency responses (sometimes called the Fourier Transforms) are multiplied. This makes an impulse response capable of representing a wide variety of filters. Most simply, the response [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] is a lowpass filter, but determining its frequency response is not something I have yet looked into. In the case of 3D audio, your ears are the filter. MIT provides us a dataset of responses which describe different positions and how your ears perceive them; you convolve a mono file twice, once for the left and right channels, and play it back as stereo. It then sounds 3D through headphones. This technique does not apply to surround sound system and does not work through regular speakers.

The most naive algorithm for convolution is two nested for loops; this is the algorithm I have currently implemented. I believe its complexity to be O(N*S), where N is the length of the impulse response and S is the length of the signal to be convolved. This is unimportant in the grand scheme of things-all we care about is if it is fast enough.

I am going to get to the point, and then talk about how the point is possible and why it's not quite as good as it looks. I am convolving approximately 225 times faster than realtime. This means something like 200 sources, in theory. In practice, this will be much less. The code is here.

Here's how I did it.

I began yesterday by implementing the algorithm for 3D audio in Numpy because it was the most expedient reference implementation I can find. This implementation is performant, but I did not collect actual statistics on it because it's not important in the long-run. This gave me a valuable place to experiment with a few things that I didn't understand and allowed me to develop the click filter (discussed below) quickly.

I then turned to C. The supporting program is very very straight-forward and will not be discussed. I implemented the convolution in convolution_kernel, a function that knows how to convolve and do nothing else. Most notably, history must be kept track of; this is, however, expected to be done before the function is called. The supporting program reads an entire file representing a sound, reads a stereo file representing impulse responses for the left and right ears, and performs the convolution and click filter (the latter is included in this profiling result).

The next step was adding OpenMP. This was a slight clarity hit in terms of the code, but in cases where more than 1024 samples are to be computed by the kernel at once, a 3-fold or 4--fold performance increase. Furthermore, it can scale to the number of cores you have. This basically means threads without the complexity: the outer for loop in the function is automatically split across multiple threads by the compiler and the runtime, they all do work, and then they all join back together at the end. Perhaps most beautifully, if you don't have OpenMP, the code still compiles: OpenMP is done through pragmas. This is not true of the supporting program because it uses omp_get_wtime for profiling. I'd worry about it but the support program will shortly no longer exist.

This gave something that runs about 60 times faster than realtime on my machine.

And here's the real beauty of this. As soon as I specified /Ox to the MSVC++ command line compiler, it started completing the convolution in 1.5 seconds for a 6 minute file, including the second pass with the click filter.

And why the click filter? Because it clicks. This is extremely noticeable in music with high frequencies or a lot of pounding base. All the click filter is is a second convolution with an impulse response of {0.2, 0.2, 0.2, 0.2, -0.2}. As for how I came up with it, it was a combination of intuition and lucky guessing: it "felt" right, so I tried it. In reality, there is some distortion introduced, but that is unimportant to me for the moment and it sounds good anyway.

Here are the remaining problems that make it less cool and fast in practice, but which may not end up mattering. If I had to guess, the current implementation will be sufficient for somewhere around 70 sources, and the current implementation may still be optimized further:

  • First, there are no real memory copies going on, the history buffer does not need to be kept explicitly because it reads the whole file at once, and the code is at least somewhat cache friendly. The convolver is not called multiple times, so there is no function call overhead.

  • Second, OpenMP has a performance overhead. At the moment, the code begins to self-parallelize when the number of samples to be computed by convolution_kernel is greater than 1024. The problem is that, for realtime usage, this is approaching an impractical number. In addition, the current test program is processing millions of samples in one OpenMP parallel for, so OpenMP isn't shutting itself down and starting itself up, and the threads only join with each other at the end. The upshot of this is that there will definitely be a performance drop when it is used in a streaming manner, and that OpenMP may actually hurt it.

  • Third, this is not a complete mixer. A complete mixer will require chains that copy buffers around like crazy, or some sort of trick. A complete mixer also has some other implementation challenges I haven't figured out how to overcome yet and which are beyond the scope of this post.

  • Finally, moving sources require some interpolation: either being temporarily treated as two sources, having the impulse response changed on a per-sample basis, or something else. A moving source can be thought of as at least two still sources, and it may be yet more expensive. Moving the listener will be equivalent to moving all sources.

The general outlook is nevertheless good. The next step is writing a framework that knows how to connect DSP components together, a final mixer, and the ability to have realtime playback. Callbacks shouldn't be a major problem, and I have a few novel ideas that I might implement down the road.