LPCNet: DSP-Boosted Neural Speech Synthesis

Introducing LPCNet

This demo presents the LPCNet architecture that combines signal processing and deep learning to improve the efficiency of neural speech synthesis. Neural speech synthesis models like WaveNet have recently demonstrated impressive speech synthesis quality. Unfortunately, their computational complexity has made them hard to use in real-time, especially on phones. As was the case in the RNNoise project, one solution is to use a combination of deep learning and digital signal processing (DSP) techniques. This demo explains the motivations for LPCNet, shows what it can achieve, and explores its possible applications.

Speech Synthesis

Back in the 70s, some people started investigating how to model speech. They realized they could approximate vowels as a a series of regularly-spaced (glottal) impulses going through a relatively simple filter that represents the vocal tract. Similarly, unvoiced sounds can be approximated by white noise going through a filter.

A simple view of the source-filter model of speech production, as we've been using since the 70s.

With just that simple model, it's possible to either synthesize intelligible speech, or compress speech at very low bitrate. This is what such a simple model sounds like (LPC10 vocoder at 2.4 kb/s):

So intelligible speech, but by no means good quality. With a lot of effort, it's possible to create slightly more realistic models, leading to codecs like MELP and codec2. This is what codec2 sounds like on the same sample:

Codec2 is quite good considering it only uses 2.4 kb/s (or less), but it's still far from what would be considered high quality speech.

WaveNet

Despite small improvements, there was no major breakthrough in speech synthesis — until the deep learning revolution of the past few years. In 2016, the DeepMind folks presented WaveNet (demo, paper), a deep learning model that predicts speech one sample at a time, based on both the previous samples and some set of acoustic parameters (e.g. a spectrum). One key aspect to understanding WaveNet and similar architectures is that the network does not directly output sample values. Rather, the output is a probability distribution for the next sample value. From that distribution, there is a sampling step that generates a random value using the distribution and the result becomes the next sample value. That new sample is then fed back to the network so that the following values are consistent with all previous decisions. It's normal to wonder why we don't just pick the most likely value for the next sample, or why the network doesn't directly output the most likely value. The reason is that speech is inherently random. The best example is sounds like "s", which are essentially gaussian-distributed random noise. If we were to pick the most probable value for each sample, we would be generating all zeros, since each sample is just as likely to have a positive value as a negative value. In fact, even though vowels have less randomness than unvoiced sounds, all phonemes have some randomness, which is why we need to have this sampling operation.

WaveNet can synthesize speech with much higher quality than other vocoders, but it has one important drawback: complexity. Synthesizing speech requires tens of billions of floating-point operations per second (GFLOPS). This is too high for running on a CPU, but modern GPUs are able to achieve performance in the TFLOPS range, so no problem, right? Well, not exactly. GPUs are really good at executing lots of operations in parallel, but they're not very efficient on WaveNet for two reasons:

It is made of a stack of smaller convolutions, one on top of each other (see paper and demo above), which means many different steps.
The processing needs to be repeated sequentially — one sample at a time — rather than in batches/frames.

WaveRNN

The WaveRNN architecture (by the authors of WaveNet) comes in to solve problem 1) above. Rather than using a stack of convolutions, it uses a recurrent neural network (RNN) that can be computed in fewer steps. The use of an RNN with sparse matrices also reduces complexity, so that we can consider implementing it on a CPU. That being said, its complexity is still in the order of ~10 GFLOPS, which is about two orders of magnitude more than typical speech processing algorithms (e.g. codecs).

Simplified overview of WaveRNN and similar algorithms. For each time step, the RNN computes the probability distribution. The next value is then sampled from the distribution, and then both played out and fed back to the RNN. Move the mouse over this figure to see how this process expands through time.

Here comes LPCNet

One of the reasons "pure neuron" approaches are expensive is that the neurons have to do all the work. They not only have to generate plausible glottal pulses, but they also have to simulate the filtering in the vocal tract. That filtering is easy to do with 70s digital signal processing (DSP), but it's actually hard to do for a neural network. That's because the DSP operations needed for the filtering don't map very well to neural networks. First, computing prediction coefficients from a spectrum essentially requires solving a system of linear equations. There are efficient DSP algorithms to do that (e.g. Levinson-Durbin), but you can't really implement an equivalent with just a few layers of neurons. Even implementing the filtering itself — which is a linear operation — doesn't map well with the WaveNet or WaveRNN architectures. That's because the filtering involves multiplying different inputs together, whereas networks will typically compute linear combinations of inputs, with no multiplicative terms between the inputs. Of course, we could create a strange architecture that's able to learn the filtering, but at that point, why not just add some DSP blocks to directly help a more conventional neural network?

That's where LPCNet comes in. The approach can be summarized as "don't throw the DSP out with the bath water". LPCNet is a variant of WaveRNN with a few improvements, of which the most important is adding explicit LPC filtering. Instead of only giving the RNN the selected sample, we can also give it a prediction (i.e. an estimate) of the next sample it's going to pick. That prediction doesn't have to be perfect, and it won't include the random component, but it is very cheap to compute and ends up replacing a lot of neurons. We can also make the network predict the difference between the next sample and the prediction, a.k.a. the excitation. Now our network is no longer working hard on modeling the vocal tract (though it can fix any problem with the prediction).

High-level LPCNet overview. Compared to WaveRNN, we add the LPC block that uses the features to compute prediction coefficients (solving a linear system with Levinson-Durbin), and the Filter block that computes the prediction from past samples and from the prediction coefficients.

Beyond linear prediction, LPCNet also includes a few other tricks:

Pre-emphasis/de-emphasis filters: This allows shaping the noise caused by the μ-law quantization. Rather than having white quantization noise at the output (like for the original WaveNet) or having to increase the size of the network to output 16 bits (like WaveRNN), LPCNet is able to shape the μ-law quantization noise to be mostly inaudible.
Sparse matrices: Like WaveRNN, LPCNet uses sparse matrices in the main RNN, keeping only 10% of the weights (the others being zeros). The matrices are block-sparse, with blocks of size 16x1 to make it easier to vectorize the products. As a minor improvement, all the weights on the diagonal of the matrices are kept (instead of forcing many non-zero blocks along the diagonal).
Input embedding: Rather than feeding the inputs directly to the network, we use an embedding matrix. Embedding is most commonly used for words when doing natural language processing, but using it for μ-law values makes it possible to learn non-linear functions of the input. Interestingly, when looking at the embedding learned, one of the functions is very close to the μ-law expansion (decoding) function.

See below a more complete block diagram of LPCNet.

Looking into LPCNet in more details. The left part of the network (yellow) is computed once per frame and its output is used for the sample rate network on the right (blue). The compute prediction block predicts the sample at time t based on previous samples and on the linear prediction coefficients.

Results

We ran a MUSHRA-like listening test to see how LPCNet compares to WaveRNN+, a slightly improved version of WaveRNN (without the LPC part). We had 8 utterances (2 male and 2 female speakers), each evaluated by 100 participants. The results below show that the LPC filtering in LPCNet really helps improve quality for the same complexity. Alternatively, the same quality is possible at a much lower complexity.

Subjective quality (MUSHRA) results as a function of the dense equivalent number of units in the main GRU.

Complexity

In the results above, the complexity of the synthesis ranges between 1.5 GFLOPS for the network of size 61, and 6 GFLOPS for the network of size 203, with the medium size network of size 122 having a complexity around 3 GFLOPS. This is easily within the range of what a single smartphone core can do. Or course, the lower the better, so we're still looking at ways to further reduce the complexity for the same quality.

Hear For Yourself

Here are two of the samples that were used in the listening test above.

Select sample

Female
Male

Select algorithm

WaveRNN+61
WaveRNN+122
WaveRNN+203

LPCNet61
LPCNet122
LPCNet203
Reference

Select where to start playing when selecting a new sample

Player will continue when changing sample.

Comparing the speech synthesis quality of LPCNet with that of WaveRNN+. This demo will work best with a browser that supports Ogg/Opus in HTML5 (Firefox, Chrome and Opera do), but if Opus support is missing the file will be played as FLAC, WAV, or high bitrate MP3.

Source Code

The LPCNet source code is available under a BSD license. To train your own model, you need:

At least an hour of speech data (ideally more)
An nvidia GPU with CUDA and CUDNN (or alternatively a lot of patience), ideally at least a GTX 1080 Ti
Around 16 GB available RAM (amount will depend on how much data you have)

See the README.md file for complete instructions.

So what can we use this for?

Everything — and world peace! OK, maybe not quite, but there are many interesting applications once we get to the point where CPU consumption is no longer a problem. Here are just some of the possibilities.

Text-to-Speech (TTS)

TTS is the main reason people started looking into neural speech synthesis in the first place. They were able to effectively generate synthesis "features" (spectrum, prosody) with neural networks and other techniques, but they weren't able to actually synthesize good speech out of those features. Then, they used neural networks to do the synthesis and got good results. So LPCNet could be used within algorithms like Tacotron to build a complete, high-quality TTS system.

Speech Compression

It's also possible to see LPCNet as a low bitrate speech codec. Last year, Google demonstrated a codec that could use codec2's bitstream and achieve good quality wideband speech even at 2.4 kb/s (samples), which in part inspired this work on LPCNet. In fact, codec2 author David Rowe has already started a (very) new project to combine LPCNet with codec2.

Read David's latest blog post for samples of his LPCNet-based codec2 implementation. Note that since codec2 has not (yet) been expanded to wideband, his samples are still narrowband (instead of the wideband samples provided above). Expect many improvements in the future.

Time Stretching

Applications don't have to end with just TTS and compression though. With a trivial change, it's possible to turn LPCNet into a time stretching algorithm for speech. By time stretching, we mean being able to speed up or slow down a speech signal without changing the pitch. Algorithms already exist for doing that, but they often result in unnatural speech. That's where LPCNet can help. Time stretching using LPCNet is surprisingly simple: all that's needed is to tell the synthesis network to generate either more or fewer than 10 ms of speech for every 10-ms frame processed by the feature processing network.

Hear below a speech sampled slowed down by 50%.

Hear the same sample slowed down by a conventional algorithm (using SoundTouch).

The difference is especially noticeable when slowing down because then conventional pitch-based algorithms have to "make up" speech data (as opposed to discarding data when speeding up). That being said, LPCNet works quite well when speeding up too.

Noise Suppression

Another application where this work could be useful is noise suppression. Last year, we demonstrated RNNoise, which already showed pretty good results, despite its low complexity. The neural part of RNNoise works purely on the general shape of the spectrum. It can attenuate bands that have a lot of noise in them, but it doesn't know much about what speech signals are. A synthesis-based approach would be able to re-generate a plausible speech signal not only from the features, but also using the noisy signal to make sure the waveform of the denoised speech matches that of the noisy speech (minus the noise). This would also ensure that the denoiser doesn't reduce the quality of already clean speech.

Codec Post-Filtering

If we can remove background noise from speech, we should also be able to remove speech coding artifacts. And we already know how to synthesize speech directly from the spectral features that codecs use. So by combining the two, it should be possible to significantly enhance the quality of an existing codec (Opus comes to mind here) at low bitrate, without running into the artifacts. Of course, we haven't actually tested any of this, but it wouldn't be too surprising to be able to get near-perfect narrowband speech around 6 kb/s and near-perfect wideband speech around 9 kb/s.

Packet Loss Concealment (PLC)

The last logical application of this work is for packet loss concealment. When a packet goes missing during a call, we still need to play some audio, so the receiver has to "guess" what was in that packet. In most codecs, this is done by repeating pitch patterns for voiced sounds and extending noise for unvoiced sounds. Like for time stretching, we should be able to do better. LPCNet is already trained to predict future samples from past samples and current acoustic parameters, so all we need to do is train it to predict only from past samples (and possibly past parameters). In fact, if we're already building a codec using neural synthesis, it's not hard to simply train it with random "missing" frames in its training set.

What's left to do?

Still quite a lot. First, the code we have right now still generates samples using Keras Python code. This would not be a problem if the network produced all the samples at once, but because of the sampling process, speech synthesis requires running the network once for every sample we generate. Because of that, the overhead becomes quite large, making the synthesis quite slow even though it doesn't need to be. To fix that, what we need to do is translate the network implementation to C, just like we did for RNNoise. Once translated to C, and before any special optimizations, the performance should already be enough to easily run in real-time on x86 and should be close to real-time (or barely real-time) on a recent smartphone. From there, vectorization and other optimizations should make the code easily real-time on a smartphone and probably real-time on slower ARM devices (e.g. Raspberry Pi).

Updated: The C version of the LPCNet122 model now achieves real-time synthesis with 15-20% of a single x86 (Haswell or later) core. It also achieves real-time synthesis on a single core of an iPhone 6s. It is not (yet) real-time on a Raspberry Pi 3.

There are probably many ways of further improving the model itself and making it less complex. Some directions for future research include:

Making the model output a continuous probability distribution rather than discrete probabilities corresponding to all μ-law values (e.g. using a logistic mixture).
Adding pitch: LPCNet can take advantage of LPC-based short-term prediction, but it cannot yet use longer prediction based on pitch. Adding pitch prediction is a little trickier, but should be possible by introducing an attention mechanism that can "align" the signal with its past samples offset by the pitch period. ▶ Show nerdy details ▼ Hide nerdy details For anyone curious enough to try adding pitch to the LPCNet model, here are some tips. One of the first attempts at adding pitch to LPCNet consisted in using the pitch period feature T, and using it to look into the past of the signal and/or excitation. The network would then receive an extra input corresponding to the looked-up values at time (t-T) or (t-T+offset) to add a bit of lookahead to the process. That's all good in principle, except for the fact that pitch estimation is not an easy thing to do and tends to be unreliable (especially the one in LPCNet at the moment). A better way would be to have the neural network estimate the pitch by itself from the features. Then comes the problem of how to implement the look-up at time (t-T+offset) with neurons (the look-up operation is now part of the network we need to differentiate). The answer to that already exists, in the form of an attention mechanism. By using a softmax-like output for the pitch, it's possible to drive the attention mechanish to look-up the past values at the correct delay. The general concept of turning deterministic operations like reads and writes into neurons that can handle backpropagation is part of differentiable programming.
Improving the robustness to noise: Right now, synthesizing noisy speech creates additional artifacts due to the training on clean speech. Training on noisy speech helps, but requires more neurons to produce acceptable quality.

There's also a lot of work to do on applications themselves. As cool as we think LPCNet is, it doesn't actually do anything useful alone. It's more of a base technology that helps build many cool things on top of it.

If you would like to comment on this demo, you can do so on on my blog.

—Jean-Marc Valin (jmvalin@jmvalin.ca) November 20, 2018

Additional Resources

J.-M. Valin, J. Skoglund, LPCNet: Improving Neural Speech Synthesis Through Linear Prediction, Submitted for ICASSP 2019.
A. van den Oord and S. Dieleman and H. Zen and K. Simonyan and O. Vinyals and A. Graves and N. Kalchbrenner and A. Senior and K. Kavukcuoglu, WaveNet: A Generative Model for Raw Audio, 2016.
Kalchbrenner, N. and Elsen, E. and Simonyan, K. and Noury, S. and Casagrande, N. and Lockhart, E. and Stimberg, F. and van den Oord, A. and Dieleman, S. and Kavukcuoglu, K., Efficient Neural Audio Synthesis, 2018.
Kleijn, W. B. and Lim, F. SC and Luebs, A. and Skoglund, J. and Stimberg, F. and Wang, Q. and Walters, T. C., Wavenet based low rate speech coding, 2018
LPCNet source code.
David Rowe's LPCNet-Based Codec 2.
Join our development discussion in #opus at irc.freenode.net (→web interface)

Jean-Marc's Opus documentation work is sponsored by the Mozilla Corporation.
(C) Copyright 2018 Mozilla and Xiph.Org