JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]


Journal of Information Science and Engineering, Vol. 26 No. 6, pp. 1941-1956


Using Random Forest for Protein Fold Prediction Problem: An Empirical Study


ABDOLLAH DEHZANGI, SOMNUK PHON-AMNUAISUK AND OMID DEHZANGI*
Center of Artificial Intelligence and Intelligent Computing 
Faculty of Information Technology 
Multimedia University 
Cyberjaya, Selangor, 63100 Malaysia 
E-mail: {abdollah.dehzangi07; somnuk.amnuaisuk}@mmu.edu.my 
*School of Computer Engineering 
Nanyang Technological University 
Nanyang Avenue, 639798 Singapore 
E-mail: omid0002@ntu.edu.sg


    The functioning of a protein in biological reactions crucially depends on its threedimensional structure. Prediction of the three-dimensional structure of a protein (tertiary structure) from its amino acid sequence (primary structure) is considered as a challenging task for bioinformatics and molecular biology. Recently, due to tremendous advances in the pattern recognition field, there has been a growing interest in applying classification approaches to tackle the protein fold prediction problem. In this paper, Random Forest, as a kind of ensemble method, is employed to address this problem. The Random Forest, is a recently introduced method based on bagging algorithm that trains a group of base classifiers by randomly selecting sets of features and then, combining results obtained from base classifiers by majority voting. To investigate the effectiveness of the number of base learners to the performance of the Random Forest, twelve different numbers of base classifiers (between 30 and 600) are applied for this classifier. To study the performance of the Random Forest and compare its results with previously reported results, the dataset produced by Ding and Dubchak is used. Our experimental results show that the Random Forest enhances the prediction accuracy (using same set of features proposed by Dubchak et al.) as well as reduces time consumption of the protein fold prediction task, compared to the previous works found in the literature.


Keywords: protein fold prediction problem, classifier ensemble, random forest, bootstrap sampling, weak learner, feature selection, random sampling, bagging, prediction performance

  Retrieve PDF document (JISE_201006_01.pdf)