JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]


Journal of Information Science and Engineering, Vol. 30 No. 5, pp. 1463-1481


Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF


LU LIU1,2 AND TAO PENG1,2,3,*
1College of Computer Science and Technology
Jilin University
Changchun, 130012 China
2Department of Computer Science
University of Illinois at Urbana-Champaign
Urbana, 61801 USA
3Key Laboratory of Symbol Computation and Knowledge Engineering
Ministry of Education
Changchun, 130012 China

 


    PU learning occurs frequently in Web pages classification and text retrieval applications because users may be interested in information on the same topic. Collecting reliable negative examples is a key step in PU (Positive and Unlabeled) text classification, which solves a key problem in machine learning when no labeled negative examples are available in the training set or negative examples are difficult to collect. Thus, this paper presents a novel clustering-based method for collecting reliable negative examples (CCRNE). Different from traditional methods, we remove as many probable positive examples from unlabeled set as possible, which results that more reliable negative examples are found out. During the process of building classifier, a novel TFIDF-improved feature weighting approach, which reflects the importance of the term in the positive and negative training examples respectively, is presented to describe documents in the Vector Space Model. We also build a weighted voting classifier by iteratively applying the SVM algorithm and implement OCS (One-class SVM), PEBL (Positive Example Based Learning) and 1-DNFII (Constrained 1-DNF) methods used for comparison. Experimental results on three real-world datasets (Reuters Corpus Volume 1 (RCV1), Reuters-21578 and 20 Newsgroups) show that our proposed C-CRNE extracts more reliable negative examples than the baseline algorithms with very low error rates. And our classifier outperforms other state-of-art classification methods from the perspective of traditional performance metrics.


Keywords: text classification, reliable negative examples, clustering, C-CRNE, WVC

  Retrieve PDF document (JISE_201405_10.pdf)