## Supervised Evaluation of Top-k Itemset Mining Algorithms

Accepted at DAWAK ’15: 17th International Conference on Data Warehousing and Knowledge Discovery [1].

Abstract. A major mining task for binary matrixes is the extraction of approximate top-k patterns that are able to concisely describe the input data. The top-k pattern discovery problem is commonly stated as an optimization one, where the goal is to minimize a given cost function, e.g., the accuracy of the data description. In this work, we review several greedy state-of-the-art algorithms, and propose a methodology to compare the patterns extracted. In evaluating the set of mined patterns, we aim at overcoming the usual assessment methodology, which only measures the given cost function to minimize. Thus, we evaluate how good are the models/patterns extracted in unveiling supervised knowledge on the data. To this end, we test algorithms and diverse cost functions on several datasets from the UCI repository. As contribution, we show that PaNDa+ performs best in the majority of the cases, since the classifiers built over the mined patterns used as dataset features are the most accurate.

### Evaluation methodology

We observe the lack of common real-world benchmarks, e.g., datasets for which the most important embedded patterns are known, and that can be used as a ground truth to evaluate and compare algorithms. We thus propose a supervised evaluation methodology of the extracted top-k patterns that measures and compares the ability of each algorithm in discovering interesting features that we can use to train accurate classifier.

Each binary UCI dataset, from which the class labels is removed in advance, is
first subdivided into training and test set. We employ three state-of-the-art
algorithms, to extract a distinct a pattern set Π_{k} from the training set. Then, every
transaction in the training and test sets is mapped into the pattern space by
considering a binary feature for each approximate pattern in Π_{k} indicating its
presence/absence in the transaction. Note that the algorithms we evaluated produce,
for each pattern ⟨P_{I},P_{T}⟩∈ Π_{k}, the set of transactions P_{T} in the training where P_{I}
it occurs, and this set is used for the mapping. For what regards the test
set, we say that an approximate pattern P ∈ Π_{k} occurs in an unseen test
transaction t iff |P_{I} ∩ t|∕|P_{I}|≥ η, where η is the minimum intersection
ratio |P_{I} ∩ t^{*}|∕|P_{I}| for every training transaction t^{*}∈ P_{T}. The rationale is
to accept a pattern P for a test transaction t if it does not generate more
noise than what it has been observed in the training set. After mapping the
transactions into such pattern space, the class labels are restored in the
transformed training and test sets. The training set is thus used to train
an SVM classifier. Finally, the classifier is evaluated on the mapped test
set.

Results show that PaNDa+ outperforms all the other algorithms with a statistically significant improvement.