View on GitHub

VAELoopDemo

audio samples generated by VAE-Loop

Paper Info

Paper: Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder (Submitted to InterSpeech2018, arxiv)
Authors: Kei Akuzawa, Yusuke Iwasawa, Yutaka Matsuo
Abstruct: Recent advances in neural autoregressive models have improve the performance of speech synthesis (SS). However, as they lack the ability to model global characteristics of speech (such as speaker individualities or speaking styles), particularly when these characteristics have not been labeled, making neural autoregressive SS systems more expressive is still an open issue. In this paper, we propose to combine VoiceLoop, an autoregressive SS model, with Variational Autoencoder (VAE). This approach, unlike traditional autoregressive SS systems, uses VAE to model the global characteristics explicitly, enabling the expressiveness of the synthesized speech to be controlled in an unsupervised manner. Experiments using the VCTK and Blizzard2012 datasets show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.
Reference: Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop,” in Proc. 6th International Conference on Learning Representations, 2018.
Preliminary work in Japan: 変分自己符号化器を用いた表現の多様性のモデル化による表現豊かな音声合成 (2018年度人工知能学会全国大会（第32回）)

Demo Description:

Audio samples for each z are generated by the same latent variable sampled from normal distribution. Note that we did not use any speaker or speaking style label for both training and generating.

VAELoopDemo

audio samples generated by VAE-Loop

Paper Info

Demo Description:

VCTK

z_1

0.5z_1 + 0.5z_2

0.3z_1 + 0.7z_2

0.2z_1 + 0.8z_2

z_2

Blizzard2012

z_1

z_2

0.5z_1 + 0.5z_2

z_3