PIC

Claudio Lucchese


I’m researcher with the I.S.T.I. “A. Faedo” in Pisa, working with the High Performance Computing Lab. Here you may find contact info, résumé, and other details. Below some news and notes on my research activities in the areas of data mining and information retrieval.

Sep. 20 2016

Best Student Paper Award at Document Engineering 2016

Our paper Ceccarelli, D., Lucchese, C., Orlando, S., Perego, R., and Trani, S., SEL: a unified algorithm for entity linking and saliency detection accepted at DocEng 16: Proceedings of the 2015 ACM Symposium on Document Engineering was awarded Best Student Paper[1].

References

[1]   Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. SEL: a unified algorithm for entity linking and saliency detection. In DocEng ’16: Proceedings of the 2015 ACM Symposium on Document Engineering, 2016. (best student paper).

Share on

Aug. 11 2016

Fast ranking with additive ensembles of oblivious and non-oblivious regression trees

Journal paper accepted at ACM Transactions on Information Systems [1].

Abstract. Learning-to-Rank models based on additive ensembles of regression trees have been proven to be very effective for scoring query results returned by large-scale Web search engines. Unfortunately, the computational cost of scoring thousands of candidate documents by traversing large ensembles of trees is high. Thus, several works have investigated solutions aimed at improving the efficiency of document scoring by exploiting advanced features of modern CPUs and memory hierarchies. In this paper, we present QuickScorer, a new algorithm that adopts a novel cache-efficient representation of a given tree ensemble, it performs an interleaved traversal by means of fast bitwise operations, and also supports ensembles of oblivious trees. An extensive and detailed test assessment is conducted on two standard Learning-to-Rank datasets and on a novel very-large dataset we made publicly available for conducting significant efficiency tests. The experiments show unprecedented speedups over the best state-of-the-art baselines ranging from 1.9x to 6.6x. The analysis of low-level profiling traces shows that QuickScorer efficiency is due to its cache-aware approach both in terms of data layout and access patterns, and to a control flow that entails very low branch mis-prediction rates.

References

[1]   Domenico Dato, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. Fast ranking with additive ensembles of oblivious and non-oblivious regression trees. ACM Transactions on Information Systems, ((in press)), 2016.

Share on

Aug. 01 2016

Computing Reviews’ Notable Books and Articles 2015

Our paper Lucchese C., Nardini F.M., Orlando S., Perego R., Tonellotto N., Venturini R., QuickScorer: a Fast Algorithm to Rank Documents with Additive Ensembles of Regression Trees [1], best paper at ACM SIGIR 2015, was selected as a ACM Notable Article in Computing 2015.

PIC

This looks like a good reason for reading our paper [1] and its recent improvement [2].

References

[1]   Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. Quickscorer: a fast algorithm to rank documents with additive ensembles of regression trees. In SIGIR ’15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2015. (best paper) (ACM Notable Article).

[2]   Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, Nicola Tonellotto, and Rossano Venturini. Exploiting cpu simd extensions to speed-up document scoring with tree ensembles. In SIGIR ’16: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2016.

Share on

July 12 2016

Sentiment-enhanced Multidimensional Analysis of Online Social Networks: Perception of the Mediterranean Refugees Crisis

Accepted at SNAST ’16: Workshop on Social Network Analysis Surveillance Technologies in conjunction with ASONAM ’16: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining [1].

Abstract. We propose an analytical framework able to investigate discussions about polarized topics in online social networks from many different angles. The framework supports the analysis of social networks along several dimensions: time, space and sentiment. We show that the proposed analytical framework and the methodology can be used to mine knowledge about the perception of complex social phenomena.

We selected the refugee crisis discussions over Twitter as a case study. This difficult and controversial topic is an increasingly important issue for the EU. The raw stream of tweets is enriched with space information (user and mentioned locations), and sentiment (positive vs. negative) w.r.t. refugees. Our study shows differences in positive and negative sentiment in EU countries, in particular in UK, and by matching events, locations and perception, it underlines opinion dynamics and common prejudices regarding the refugees.

References

[1]   Mauro Coletto, Andrea Esuli, Claudio Lucchese, Cristina Ioana Muntean, Franco Maria Nardini, Raffaele Perego, and Chiara Renso. Sentiment-enhanced multidimensional analysis of online social networks: Perception of the mediterranean refugees crisis. SNAST ’16: Workshop on Social Network Analysis Surveillance Technologies, colocated with ASONAM ’16: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, 2016.

Share on

July 10 2016

Fast Connected Components Computation in Large Graphs by Vertex Pruning

Journal paper accepted at TPDS ’16: IEEE Transactions on Parallel and Distributed Systems [1].

Abstract. Finding connected components is a fundamental task in applications dealing with graph analytics, such as social network analysis, web graph mining and image processing. The exponentially growing size of today’s graphs has required the definition of new computational models and algorithms for their efficient processing on highly distributed architectures. In this paper we present cracker, an efficient iterative MapReduce-like algorithm to detect connected components in large graphs. The strategy of cracker is to transform the input graph in a set of trees, one for each connected component in the graph. Nodes are iteratively removed from the graph and added to the trees, reducing the amount of computation at each iteration. We prove the correctness of the algorithm, evaluate its computational cost and provide an extensive experimental evaluation considering a wide variety of synthetic and real-world graphs. The experimental results show that cracker consistently outperforms state-of-the-art approaches both in terms of total computation time and volume of messages exchanged.

References

[1]   Alessandro Lulli, Emanuele Carlini, Patrizio Dazzi, Claudio Lucchese, and Laura Ricci. Fast connected components computation in large graphs by vertex pruning. IEEE Transactions on Parallel and Distributed Systems, PP(99), 2016.

Share on

July 02 2016

Fast Feature Selection for Learning to Rank

Accepted at ICTIR ’16: International Conference on the Theory of Information Retrieval [1].

Abstract. An emerging research area named Learning-to-Rank (LtR) has shown that effective solutions to the ranking problem can leverage machine learning techniques applied to a large set of features capturing the relevance of a candidate document for the user query. Large-scale search systems must however answer user queries very fast, and the computation of the features for candidate documents must comply with strict back-end latency constraints. The number of features cannot thus grow beyond a given limit, and Feature Selection (FS) techniques have to be exploited to find a subset of features that both meets latency requirements and leads to high effectiveness of the trained models.

In this paper, we propose three new algorithms for FS specifically designed for the LtR context where hundreds of continuous or categorical features can be involved. We present a comprehensive experimental analysis conducted on publicly available LtR datasets and we show that the proposed strategies outperform a well-known state-of-the-art competitor.

References

[1]   Claudio Lucchese Andrea Gigli, Franco Maria Nardini, and Raffaele Perego. Fast feature selection for learning to rank. In ICTIR ’16: International Conference on the Theory of Information Retrieval, 2016.

Share on

July 02 2016

Learning to Rank User Queries to Detect Search Tasks

Accepted at ICTIR ’16: International Conference on the Theory of Information Retrieval [1].

Abstract. We present a framework for discovering sets of web queries having similar latent needs, called search tasks, from user queries stored in a search engine log. Our solution is made of two components: a Query Similarity Learning (QSL) module and a Graph-based Query Clustering (GQC) module. The former is devoted to learning a query similarity function from a ground truth of manually-labeled search tasks. The latter represents each user search log as a graph whose nodes are queries, and uses the learned similarity function to weight edges between query pairs. Finally, search tasks are detected by clustering those queries in the graph which are connected by the strongest links, in fact by detecting the strongest connected components of the graph. To discriminate between “strong” and “weak” links also the GQC module entails a learning phase whose goal is to estimate the best threshold for pruning the edges of the graph. We discuss how the GSL module can be effectively implemented using Learning to Rank (L2R) techniques. Experiments on a real-world search engine log show that query similarity functions learned using L2R lead to better performing GQC implementations when compared to similarity functions induced by other state-of-the-art machine learning solutions, such as logistic regression and decision trees.

References

[1]   Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, and Gabriele Tolomei. Learning to rank user queries to detect search tasks. In ICTIR ’16: International Conference on the Theory of Information Retrieval, 2016.

Share on

May 20 2016

Evaluating Top-k Approximate Patterns via Text Clustering

Accepted at DAWAK ’16: 18th International Conference on Data Warehousing and Knowledge Discovery [1].

Abstract. This work investigates how approximate binary patterns can be objectively evaluated by using as a proxy measure the quality achieved by a text clustering algorithm, where the document features are derived from such patterns. Specifically, we exploit approximate patterns within the well-known FIHC (Frequent Itemset-based Hierarchical Clustering) algorithm, which was originally designed to employ exact frequent itemsets to achieve a concise and informative representation of text data. We analyze different state-of-the-art algorithms for approximate pattern mining, in particular we measure their ability in extracting patterns that well characterize the document topics in terms of the quality of clustering obtained by FIHC. Extensive and reproducible experiments, conducted on publicly available text corpora, show that approximate itemsets provide a better representation than exact ones.

References

[1]   Claudio Lucchese, Salvatore Orlando, and Raffaele Perego. Evaluating top-k approximate patterns via text clustering. In DAWAK ’16: 18th International Conference on Data Warehousing and Knowledge Discovery, 2016.

Share on

May 19 2016

SEL: a Unified Algorithm for Entity Linking and Saliency Detection

Accepted at DocEng ’16: Proceedings of the 2015 ACM Symposium on Document Engineering [1].

Abstract. The Entity Linking task consists in automatically identifying and linking the entities mentioned in a text to their URIs in a given Knowledge Base, e.g., Wikipedia. Entity Linking has a large impact in several text analysis and information retrieval related tasks. This task is very challenging due to natural language ambiguity. However, not all the entities mentioned in a document have the same relevance and utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in a document, also known as Salient Entities, is attracting increasing interest.

In this paper we propose SEL, a novel supervised two-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is based on a classifier aimed at identifying a set of candidate entities that are likely to be mentioned in the document, thus maximizing the precision of the method without hindering its recall. The second step is still based on machine learning, and aims at choosing from the previous set the entities that actually occur in the document. Indeed, we tested two different versions of the second step, one aimed at solving only the entity linking task, and the other that, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on two different datasets show that the proposed algorithm outperforms state-of-the-art competitors, and is able to detect salient entities with high accuracy.

References

[1]   Diego Ceccarelli, Claudio Lucchese, Salvatore Orlando, Raffaele Perego, and Salvatore Trani. SEL: a unified algorithm for entity linking and saliency detection. In DocEng ’16: Proceedings of the 2015 ACM Symposium on Document Engineering, 2016. (best student paper).

Share on

May 18 2016

Quality versus efficiency in document scoring with learning-to-rank models

Journal paper accepted at Information Processing & Management [1].

Abstract. Learning-to-Rank (LtR) techniques leverage machine learning algorithms and large amounts of training data to induce high-quality ranking functions. Given a set of documents and a user query, these functions are able to precisely predict a score for each of the documents, in turn exploited to effectively rank them. Although the scoring efficiency of LtR models is critical in several applications - e.g., it directly impacts on response time and throughput of Web query processing it has received relatively little attention so far.

The goal of this work is to experimentally investigate the scoring efficiency of LtR models along with their ranking quality. Specifically, we show that machine-learned ranking models exhibit a quality versus efficiency trade-off. For example, each family of LtR algorithms has tuning parameters that can influence both effectiveness and efficiency, where higher ranking quality is generally obtained with more complex and expensive models. Moreover, LtR algorithms that learn complex models, such as those based on forests of regression trees, are generally more expensive and more effective than other algorithms that induce simpler models like linear combination of features.

We extensively analyze the quality versus efficiency trade-off of a wide spectrum of state-of-the-art LtR, and we propose a sound methodology to devise the most effective ranker given a time budget. To guarantee reproducibility, we used publicly available datasets and we contribute an open source C++ framework providing optimized, multi-threaded implementations of the most effective tree-based learners: Gradient Boosted Regression Trees (GBRT), Lambda-Mart (λ-MART), and the first public-domain implementation of Oblivious Lambda-Mart (Ωλ-MART), an algorithm that induces forests of oblivious regression trees.

We investigate how the different training parameters impact on the quality versus efficiency trade-off, and provide a thorough comparison of several algorithms in the quality-cost space. The experiments conducted show that there is not an overall best algorithm, but the optimal choice depends on the time budget.

The source code of algorithms’ implementations is made available as part of QuickRank: http://quickrank.isti.cnr.it/.

References

[1]   Gabriele Capannini, Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego, and Nicola Tonellotto. Quality versus efficiency in document scoring with learning-to-rank models. Information Processing & Management, 2016.

Share on