Supervised Evaluation of Top-k Itemset Mining Algorithms
Accepted at DAWAK ’15: 17th International Conference on Data Warehousing and Knowledge Discovery .
Abstract. A major mining task for binary matrixes is the extraction of approximate top-k patterns that are able to concisely describe the input data. The top-k pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, e.g., the accuracy of the data description. In this work, we review several greedy state-of-the-art algorithms, and propose a methodology to compare the patterns extracted. In evaluating the set of mined patterns, we aim at overcoming the usual assessment methodology, which only measures the given cost function to minimize. Thus, we evaluate how good are the models/patterns extracted in unveiling supervised knowledge on the data. To this end, we test algorithms and diverse cost functions on several datasets from the UCI repository. As contribution, we show that PaNDa+ performs best in the majority of the cases, since the classifiers built over the mined patterns used as dataset features are the most accurate.
We observe the lack of common real-world benchmarks, e.g., datasets for which the most important embedded patterns are known, and that can be used as a ground truth to evaluate and compare algorithms. We thus propose a supervised evaluation methodology of the extracted top-k patterns that measures and compares the ability of each algorithm in discovering interesting features that we can use to train accurate classifier.
Each binary UCI dataset, from which the class labels is removed in advance, is first subdivided into training and test set. We employ three state-of-the-art algorithms, to extract a distinct a pattern set Πk from the training set. Then, every transaction in the training and test sets is mapped into the pattern space by considering a binary feature for each approximate pattern in Πk indicating its presence/absence in the transaction. Note that the algorithms we evaluated produce, for each pattern ⟨PI,PT⟩∈ Πk, the set of transactions PT in the training where PI it occurs, and this set is used for the mapping. For what regards the test set, we say that an approximate pattern P ∈ Πk occurs in an unseen test transaction t iff |PI ∩ t|∕|PI|≥ η, where η is the minimum intersection ratio |PI ∩ t*|∕|PI| for every training transaction t*∈ PT. The rationale is to accept a pattern P for a test transaction t if it does not generate more noise than what it has been observed in the training set. After mapping the transactions into such pattern space, the class labels are restored in the transformed training and test sets. The training set is thus used to train an SVM classifier. Finally, the classifier is evaluated on the mapped test set.
Results show that PaNDa+ outperforms all the other algorithms with a statistically significant improvement.