JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]


Journal of Information Science and Engineering, Vol. 25 No. 2, pp. 435-464


Fast Extraction of Maximal Frequent Subtrees Using Bits Representation


Juryon Paik, Junghyun Nam+, Dongho Won and Ung Mo Kim
Department of Computer Engineering 
Sungkyunkwan University 
Gyeonggi-do, 440-746 Korea 
+Department of Computer Science 
Konkuk University 
Seoul, 143-701 Korea


    With the continuous growth in XML data sources over the Internet, the discovery of useful information from a collection of XML documents is currently one of the main research areas occupying the data mining community. The most commonly adopted approach to this task is to extract frequently occurring subtree patterns from XML trees. But, the number of frequent subtrees usually grows exponentially with the size of trees, and therefore, mining all frequent subtrees becomes infeasible for large size trees. A more practical and scalable alternative is to use maximal frequent subtrees, the number of which is much smaller than that of frequent subtrees. Handling the maximal frequent subtrees is an interesting challenge, though, and represents the core of this paper. We present a novel, conceptually simple, yet effective algorithm, called EXiT-B, that significantly simplifies the process of mining maximal frequent subtrees. This is achieved by two distinct features. First, EXiT-B represents all of string node labels of trees by some specified length of bits. Through fast bitwise operations, the process of deciding on which paths of trees contain a given node is accelerated. Second, EXiT-B avoids time-consuming tree join operations by using a specially devised data structure called PairSet. To the best of our knowledge, EXiT-B is the first algorithm that discovers maximal frequent subtrees adopting bits representation. We also demonstrate the performance of our algorithm through extensive experiments using synthetic datasets which were generated artificially by a randomized tree-structure generator.


Keywords: XML mining, tree mining, mining methods and algorithms, bits representation, maximal frequent subtree

  Retrieve PDF document (JISE_200902_07.pdf)