Research

The High Performance Computing Laboratory (HPC Lab in short) consists of about 20 people conducting research on algorithms and information systems dealing with computational and data-intensive problems in business, social, and knowledge-based applications. The HPC Lab is located in the CNR Research Area of Pisa and is part of the Istituto di Scienza e Tecnologie dell’Informazione “A. Faedo” (ISTI), the largest ICT research institute of the National Research Council of Italy (CNR), the main public research organisation of Italy and a top level R&D performer in Europe, being the fourth beneficiary of the EU FP7 workprogramme.

HPC Lab research areas include large-scale distributed and cloud systems, efficient information indexing and retrieval, big data analytics, machine learning and artificial intelligence, mobility analysis, information extraction and semantic enrichment.

We aim at developing methods for efficient and effective management, query, retrieval and analysis of large amount of data. The emphasis is on the design, implementation and deployment of (distributed) algorithms or information systems that scale far larger than most others and can provide in near real-time effective answers to complex queries by possibly using a limited amount of computational resources. We study space/time-efficient solutions to deal with scale and latency constraints, and quality/efficiency trade-offs exploiting the peculiarities of the specific application domain (e.g. the skewness in the distribution of terms/features in docs/AI apps).

Main research topics

Our research follows a rigorous experimental methodology based on testing newly developed solutions against competitive state-of-the-art baselines by using widely accepted evaluation measures and protocols. In most of the cases we base our experiments on publicly available benchmarks and make available to the scientific community our source codes and experimental settings to permit reproducibility of the results. The main research topics addressed are detailed below.

Efficiency in Information Retrieval

Information retrieval systems and Web search engines are fundamental tools for accessing information in today’s world. In satisfying the information needs of millions of users, the quality of the search results and the speed at which the results are returned to the users are two goals that form a natural trade-off, as techniques that improve the former can also reduce the latter. Meanwhile, search engines continue to rapidly evolve, with larger indexes, more complex retrieval strategies and growing query volumes. Hence, there is a need for the development of efficient query processing infrastructures that make appropriate sacrifices in effectiveness in order to make gains in efficiency.  We address problems related to the design, implementation and deployment of scalable and interactive information systems that can provide answers to complex queries by satisfying strict latency constraints. We develop new approaches for ensuring scalability beyond the current limits of data processing infrastructures, including compression techniques, for in-memory data management and indexes that support a broad range of aggregation functions over arbitrary data dimensions, query processing strategies, for improving the throughput and reduce the latency of search systems, and new solutions for the efficient deployment and use of innovative search components such as learning to rank algorithms and neural re-ranking systems.

Highly Distributed Platforms

The large diffusion of mobile devices enabling ubiquitous access to web- or app-based services imposes very challenging performance requirements to content and service providers. The computing infrastructures hosting the backends of such services must be scalable and elastic, to sustain dynamically changing peak workloads, and pervasive, to bring the computation close to users and reduce network latency, possibly limiting the amount of data exchanged through geographic network links. Such complex and distributed infrastructures have to be managed in adaptively. We study solutions for the efficient and flexible management of complex platforms and the applications they host. We deal with heterogeneity in computational infrastructures in the context of the (multi) cloud-fog-edge continuum. We develop meta-heuristics algorithms supporting the smart mapping of applications on multi-clouds and federated environments, we provide approaches that leverage information about users’ footprints (e.g. presence, social media activities, etc.) to derive adaptive mapping models. We study highly distributed approaches (e.g., orchestration solutions) to drive point-to-point interactions between clouds to match application instances and the resources available in the hosting platforms. We investigate how to optimize distributed stream processing architectures to maximize the throughput achieved within a user-defined budget.

Efficient AI Solutions

An increasing number of information systems leverage AI techniques to provide effective services dealing with the complexity and scale of data. We gather large volumes of direct or indirect evidence of relationships of interest, and we design learning algorithms to extract accurate models to: rank results of queries, predict the evolution of complex phenomena, classify data items, provide recommendations to users. Our focus is on efficient and effective AI models that can be deployed in real-world scenarios. The respect of hard latency constraints is often imposed by large-scale interactive systems in different application fields such as web search, social networks, e-commerce, manufacturing, automotive, healthcare, and creative industries. In addition, data is often naturally decentralized and users’ data privacy is at stake when cyber-attacks and data breaches occur. To alleviate the risk we investigate resource-aware learning algorithms and decentralized AI solutions based on the federated learning paradigm. We study novel learning algorithms keeping data and most of the processing local to user’s devices and study ad-hoc orchestration solutions for deploying advanced federated learning solutions on cloud-edge platforms.

Semantic in Mobility and Social Data

We study methods to represent, manage and analyse large amount of mobility and social data. We convey the idea that pure movement data can be enriched with multiple, heterogeneous, contextual, social and semantic aspects. We propose efficient methods to semantically enrich and analyze trajectories and social data inspired by human mobility applications in the field of tourism and public transportation. Moreover, we deal with the problem of identifying the mentions referring to an entity belonging to a knowledge base, e.g., Wikipedia, in fragments of text documents and in posts of online social networks. We provide effective entity linking solutions formalizing entity-relatedness as a learning-to-rank problem, and address with supervised machine learning techniques the novel problem of labelling the entities mentioned in a text according to a notion of saliency, where the most salient entities are those that have the highest utility in understanding the topics discussed in the text.