Now, DeepMind researchers are expanding GANs to audio, with a new adversarial network approach for high fidelity speech synthesis.
One of the most commonly used TTS network architectures is WaveNet, a neural autoregressive model for generating raw audio waveforms.
Because WaveNet relies on the sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers.
That's why GANs, as an effective parallelisable model, are a viable option for more efficient TTS.DeepMind explored raw waveform generation using GANs composed of a conditional generator for producing raw speech audio and an ensemble of discriminators for analyzing the audio.
The generator learns how to convert the linguistic features and pitch information to raw audio.
In addition to their data augmentation effect, RWDs are more suitable for analyzing audio realism and how well it corresponds to the target utterance.
The paper High Fidelity Speech Synthesis with Adversarial Networks is on arXiv.