JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]


Journal of Information Science and Engineering, Vol. 34 No. 6, pp. 1493-1516


Voice Conversion Based on Locally Linear Embedding


HSIN-TE HWANG1,3, YI-CHIAO WU1, YU-HUAI PENG1,4, CHIN-CHENG HSU1,
YU TSAO2, HSIN-MIN WANG1, YIH-RU WANG3 AND SIN-HORNG CHEN3
1Institute of Information Science
2Research Center for Information Technology Innovation
Academia Sinica
Taipei, 115 Taiwan

3Department of Electrical and Computer Engineering
National Chiao Tung University
Hsinchu, 300 Taiwan

4Department of Electrical Engineering
National Tsing Hua University
Hsinchu, 300 Taiwan
E-mail: {hwanght; tedwu; jeremycchsu; whm}@iis.sinica.edu.tw; roland19930601@gmail.com;
yu.tsao@citi.sinica.edu.tw; {yrwang; schen}@mail.nctu.edu.tw


This paper presents a novel locally linear embedding (LLE)-based framework for exemplar-based spectral conversion (SC). The key feature of the proposed SC framework is that it integrates the LLE algorithm, a manifold learning method, with the conventional exemplar-based SC method. One important advantage of the LLE-based SC framework is that it can be applied to either one-to-one SC or many-to-one SC. For one-to-one SC, a parallel speech corpus consisting of the pre-specified source and target speakers' speeches is used to construct the paired source and target dictionaries in advance. During online conversion, the LLE-based SC method converts the source spectral features to the target like spectral features based on the paired dictionaries. On the other hand, when applied to many-to-one SC, our system is capable of converting the voice of any unseen source speaker to that of a desired target speaker, without the requirement of collecting parallel training speech utterances from them beforehand. To further improve the quality of the converted speech, the maximum likelihood parameter generation (MLPG) and global variance (GV) methods are adopted in the proposed SC systems. Experimental results demonstrate that the proposed one-to-one SC system is comparable with the state-of-the-art Gaussian mixture model (GMM)-based one-to-one SC system in terms of speech quality and speaker similarity, and the many-to-one SC system can approximate the performance of the one-to-one SC system.


Keywords: voice conversion, locally linear embedding, exemplar-based, many-to-one, manifold learning

  Retrieve PDF document (JISE_201806_08.pdf)