[ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ]

Journal of Information Science and Engineering, Vol. 35 No. 1, pp. 201-222

Cardinality Estimation Based on Cluster Analysis

1School of Mathematics and Information Science and Technology
Hebei Normal University of Science and Technology
Hebei, 066004 P.R. China
E-mail: {qhdzxn; peicaiyan78; caojxxf}@163.com

2Department of Information Engineering
Hebei University of Environmental Engineering
Hebei, 066004 P.R. China
E-mail: fydong_xl@126.com

In the cardinality estimation solutions based on multi-dimensional self-tuning histograms, periodical data scans are avoided and self-tuning histograms are constructed according to query feedback records (QFRs). We call this kind of cardinality estimation solutions the reactive solutions. In the existing reactive solutions, self-tuning histograms are constructed over the entire value ranges of the queried attributes and have large scales. The bucket number of a multi-dimensional self-tuning histogram increases exponentially with the dimension. That means the existing reactive solutions are stuck with the issue of "curse of dimension". Simultaneously, to construct and maintain a multi-dimensional self-tuning histogram over the entire value ranges of the queried attributes, a large number of QFRs must be accumulated at a long time span, and the cumbersome operations have to be executed repeatedly to meet a space budget, which makes the existing reactive solutions unpredictable and time-consuming. To address above issues, a new reactive solution is proposed in the paper. In the solution, the global self-tuning histogram covering the entire value range of the queried attributes is abandoned. When a new predicate p is executed, the Ward's minimum variance method is used to find k nearest QFRs with p from the QFR warehouse. Based on the found k QFRs, a micro self-tuning histogram only covering the neighborhood of p is constructed to help estimate the cardinality of p. The solution can be considered as a beneficial attempt to improve the cardinality estimation efficiency under high dimensions, and notably alleviate the issue of “curse of dimension”. Furthermore, old QFRs can be replaced rapidly by the new QFRs and the data and workload changes can be timely reflected by micro self-tuning histograms. The process of meeting a space budget is eliminated completely, which makes the whole solution reliable and dexterous.

Keywords: cardinality estimation, cluster analysis, ward’s minimum variance method, self-tuning, query feedback record

  Retrieve PDF document (JISE_201901_11.pdf)