JISE

[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Journal of Information Science and Engineering, Vol. 35 No. 3, pp. 597-610

A Combining Mutual Information and Entropy for Unknown Word Extraction from Multilingual Code-Switching Sentences

CHENG-WEI LEE, YI-LUN WU AND LIANG-CHIH YU
Department of Information Management
Yuan-Ze University
Taoyuan, 320 Taiwan
E-mail: {s989206; s986301}@mail.yzu.edu.tw; lcyu@saturn.yzu.edu.tw

In multilingual environments, a single statement may include content from more than one language, a phenomenon known as code-switching. Among speakers of Mandarin Chinese, code switching is a frequent occurrence in daily life, and this mixing of different languages poses serious challenges for language processing. This paper collects text corpora including code switching between Mandarin and English and Mandarin and Taiwanese, where Mandarin is the dominant language. Mutual information and entropy are then used as a basis for an algorithm to identify unknown words from multilingual texts which are then automatically referenced for multilingual inclusions. Experimental results show that the proposed method effectively filters unrelated new words, thus improving the accuracy of extracting unknown words.

Keywords: code switching, unknown word extraction, mutual information, entropy, natural language processing

Retrieve PDF document (JISE_201903_07.pdf)