JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]


Journal of Information Science and Engineering, Vol. 17 No. 5, pp. 805-824


Extracting Chinese Frequent Strings Without a Dictionary From a Chinese Corpus and its Applications


Yih-Jeng Lin and Ming-Shing Yu* 
Departmnet of Applied Mahtematics 
National Chung-Hsing Univeristy 
Taichung, 402 Taiwan 
E-mail: yclin@amath.nchu.edu.tw 
*E-mail: msyu@dragon.nchu.edu.tw


    This paper describes how to extract Chinese frequent strings without using a dictionary. In this paper, we generalize the notations of words and unknown words to those of frequent strings. The Chinese frequent strings (CFSs) we define include words, unknown words, and other strings that are frequently used. Some examples of CFSs are “只得將 (can only let)”, “分分秒秒 (every minute and every second)”, “為對方著想 (bearing in mind the interest of each other)”, and “並沒有人 (and nobody)”. A CFS is very useful in Chinese natural language processing and its related applications. We show its application to the following three tasks: Chinese phoneme-to-character conversion, Chinese character-to-phoneme conversion, and the determination of prosodic segments in a Chinese sentence for text-to-speech output. We have also developed a simple method to extract CFSs from a corpus. The method we propose can automatically detect such strings without the use of any lexicon, and no word segmentation is needed. We also can extract unknown words in a corpus which consist of three of more words. Such words (e.g. 網際網路) usually cannot be extracted by using a concatenation approach.


Keywords: CFS, normalized perplexity, phoneme-to-character, character-to-phoneme, prosodic segment

  Retrieve PDF document (JISE_200105_07.pdf)