JISE

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]

Journal of Information Science and Engineering, Vol. 40 No. 2, pp. 359-373

Training Speech Recognition Model with Speech Synthesis and Text Discriminator

HOU-AN LIN AND CHIA-PING CHEN
Department of Computer Science and Engineering
National Sun Yat-sen University
Kaohsiung, 804 Taiwan
E-mail: m093040066@nsysu.edu.tw; cpchen@cse.nsysu.edu.tw

In this paper, we build neural-network model-based automatic speech recognition (ASR) systems incrementally for performance improvement. First, we add an adversarial text discriminator module to train the speech recognition model to correct typos in recognition results. Experiments show that the character error rate (CER) and word error rate (WER) of the ASR system achieved 12.3% and 31.4%. Second, we insert a pre-trained speech synthesis (text-to-speech, TTS) module to the ASR model. When we exploit a pre-trained TTS in ASR training, the CER and WER are reduced from 12.6% and 31.7% to 10.8% and 24.4%, demonstrating that pre-trained TTS can improve ASR. Finally, we include both pre-trained TTS and text discriminator in ASR training. The performance of this ASR system is further improved, achieving the CER and WER of 9.9% and 22.7% respectively. On Formosa Speech Recognition Challenge task using Taibun Han-jı transcription, the proposed method also achieves better CER than a system based on hybrid DNN-HMM chain model.

Keywords: automatic speech recognition, text to speech, adversarial text discriminator, DNN-HMM chain model, formosa speech recognition challenge

Retrieve PDF document (JISE_202402_10.pdf)