We have success stories related to the design and development of industry-level solutions exploiting our research results. For example, we realized the data cleaning (web-graph analysis, deduplication of 5 Billions web documents, feature engineering and extraction) and learning-to-rank pipeline currently deployed in the Italian search engine istella.it. Our solution for fast prediction based on forests of decision trees is the state of the art in the field and is used in production by istella.it. Our partitioned Elias-Fano inverted index is used by many companies including Facebook. Within the ASSETS and eCloud EU projects we designed the metadata ranking and entity recommendation solutions deployed in the Europena portal. Our LSTM-based solution for anomaly prediction is used by CRF of the FCA group.
Visit our official GitHub account: https://github.com/hpclab
RankEval is an open-source tool for the analysis and evaluation of Learning-to-Rank models based on ensembles of regression trees. The success of ensembles of regression trees fostered the development of several open-source libraries targeting efficiency of the learning phase and effectiveness of the resulting models. RankEval aims at providing a common ground for several Learning to Rank libraries by providing useful and interoperable tools for a comprehensive comparison and in-depth analysis of ranking models. Clone and try it from GitHub!
QuickRank is an efficient Learning to Rank toolkit providing several C++ implementation of LtR algorithms. The algorithms currently implemented are: GBRT, LambdaMART, Oblivious GBRT and LambdaMART, CoordinateAscent, RankBoost. It has been designed to be efficient. It is available under Reciprocal Public License 1.5 license. Clone and try it from GitHub!
Dexter is a framework for implementing and evaluating entity linking algorithms. The entity linking task aims at identifying all the small text fragments referring to entities contained in a knowledge base, e.g., Wikipedia. Many entity linking algorithms have been proposed, but unfortunately only a few authors have released the source code or some APIs. As a result, evaluating today the performance of a method on a single subtask, or comparing different techniques is difficult. Dexter is opensource, since we believe that a shared framework is fundamental to perform fair comparisons and improve the state of the art. Visit Dexter site!
TripBuilder is an user-friendly and interactive system for planning a time-budgeted sightseeing tour of a city on the basis of the points of interest and the patterns of movements of tourists mined from user-contributed data. The knowledge needed to build the recommendation model is entirely extracted in an unsupervised way from two popular collaborative platforms: Wikipedia and Flickr. TripBuilder interacts with the user by means of a friendly Web interface that allows her to easily specify personal interests and time budget. The sightseeing tour proposed can be then explored and modified. TripBuilder demo won the best demo award at ECIR 2014. Plan your visit to Rome, Florence, Pisa and Amsterdam by clicking here .
SearchShortcuts is an efficient and effective solution to the problem of choosing the queries to suggest to web search engine users in order to help them in rapidly satisfying their information needs. SearchShortcuts is less affected by the data-sparsity problem than most state-of-the-art proposals. Thus, it is particularly effective in generating suggestions for rare queries occurring in the long tail of the query popularity distribution. The paper presenting our solution is here.
PANDA. The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, bioinformatics, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. PaNDa is a greedy algorithm for the discovery of Patterns in Noisy Datasets. By exploiting the Minimum Description Length principle, the proposed algorithm extracts succinct pattern sets that approximately describe the input data. [source]
The CoPhIR (Content-based Photo Image Retrieval) Test Collection was the largest multimedia metadata collection ever made available to the scientific community. It contains five MPEG-7 visual descriptors of 100 million photographic images downloaded from Flickr®, as well as other interesting metadata such as tags, comments, GPS coordinates, etc. The collection is currently growing up to the size of 100 million images. This activity, jointly run with the NeMIS Lab of ISTI-CNR, is supported by the SAPIR project. Information for accessing the collection are available at the CoPhIR website.