JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]


Journal of Information Science and Engineering, Vol. 29 No. 1, pp. 99-114


A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets


MAN YUAN1,2, YUAN XIN OUYANG1,2,+ AND ZHANG XIONG1,2
1School of Computer Science and Technology Beihang University
Beijing, 100191 P.R. China
2Research Institute of Beihang University in Shenzhen
VU Park, High-tech Industrial Estate
Shenzhen, 518057 P.R. China

 


    Text categorization is one of the most important research topics in Natural Language Processing and Information Retrieval due to the ever-increasing electronic documents. This paper presents a new text categorization method using frequent term sets. A novel constraint measure AD-Sup was introduced to extract discriminative features from frequent term sets for classification task. Then text documents are represented in the global feature space which contains both single terms and frequent term sets. To solve the sparse instance problem, a term weighting strategy is then implemented which assigns estimated weights using feature similarity and highly reduces the sparse rate. Through extensive experiments, the optimal proportion of single features and frequent term set features is empirically determined. Classification results on Reuters-21578 and WebKB corpus demonstrate that AD-Sup constraint is effective to extract useful frequent features and the combination strategy is effective to build better feature space and improve the SVM classifier.


Keywords: text categorization, text representation, frequent term sets, Apriori, SVM

  Retrieve PDF document (JISE_201301_07.pdf)