JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]


Journal of Information Science and Engineering, Vol. 40 No. 1, pp. 189-200


Improving Speech Synthesis by Automatic Speech Recognition and Speech Discriminator


LI-YU HUANG AND CHIA-PING CHEN
Department of Computer Science and Engineering
National Sun Yat-sen University
Kaohsiung, 804 Taiwan
E-mail: m093040070@nsysu.edu.tw; cpchen@cse.nsysu.edu.tw


Speech synthesis (text-to-speech, TTS) and automatic speech recognition (ASR) are opposite tasks yet they can be complementary. In our work, we try to improve the TTS by using ASR. ASR plays the role of verifying the output of TTS. It compares the recognized text with the ground-truth text for an ASR loss to penalize TTS. In our experiments, TTS without ASR scored 3.96 mean opinion score (MOS), and with ASR it achieved 4.21 MOS. We also enhanced TTS using the architecture of discriminator in generative adversarial networks (GANs). By adding a speech discriminator to discriminate the mel-spectrogram synthesized by the synthesizer, it can change the learning of TTS and improve the quality of the synthesized speech. In our experiments, TTS with speech discriminator scored 4.26 MOS. Finally, our best TTS system used both ASR and speech discriminator in the synthesizer model, and it reached 4.29 MOS.


Keywords: text-to-speech, automatic speech recognition, discriminator, speech synthesis, generative adversarial network

  Retrieve PDF document (JISE_202401_13.pdf)