JISE


  [1] [2] [3] [4] [5] [6] [7] [8]


Journal of Information Science and Engineering, Vol. 11 No. 1, pp. 35-49


Approximating False Hits of Disyllabic Terms in a Chinese Signature File


Tyne Liang, Suh-Yin Lee and Wei-Pang Yang*
Institute of Computer Science and Information Engineering 
National Chiao Tung University 
Hsinchu, Taiwan 300, R.O.C. 
*Institute of Computer and Information Science 
National Chiao Tung University 
Hsinchu, Taiwan 300, R.O.C.


    The signature access method is a well-proven technique in text retrieval systems. However, the drawback of signature file is the inherent false hits during the filtering process. In this paper, we discuss the problems of false hits for a Chinese disyllabic query. We find two kinds of false hits. The first is called random false hits which are attributed by the accidental setting of signature bits. The second kind of false hits, which we call adjacency false hits, is due to the lack of character sequence information in signature files. Since many Chinese query terms are disyllabic (composed of two characters), we particularly formulate the false hit probability for disyllabic query based on statistical theories. Our theoretical model has been tested in experiments using a real corpus. Satisfactory agreement of the predictions for both kinds of false hits with the experimental results have been obtained.


Keywords: Chinese text retrieval, signature file, false hits, adjacency false hit, disyllabic terms

  Retrieve PDF document (JISE_199501_03.pdf)