A Search Engine Architecture Based on Collection Selection

Viagiar descanta, ma chi parte mona, torna mona. (Old Venetian proverb.)

Author: Diego Puppin
Supervisors: Domenico Laforenza and Marco Vanneschi
Referees: Ophir Frieder and Gerhard Weikum

Material for the presentation on Sept. 11, 2007

Abstract

In this thesis, we present a distributed architecture for a Web search engine, based on the concept of collection selection. We introduce a novel approach to partition the collection of documents, able to greatly improve the effectiveness of standard collection selection techniques (CORI), and a new selection function outperforming the state of the art. Our technique is based on the novel query-vector (QV) document model, built from the analysis of query logs, and on our strategy of co-clustering queries and documents at the same time.

Incidentally, our partitioning strategy is able to identify documents that can be safely moved out of the main index (into a supplemental index), with a minimal loss in result accuracy. In our test, we could move 50% of the collection to the supplemental index with a minimal loss in recall.

By suitably partitioning the documents in the collection, our system is able to select the subset of servers containing the most relevant documents for each query. Instead of broadcasting the query to every server in the computing platform, only the most relevant will be polled, this way reducing the average computing cost to solve a query.

We introduce a novel strategy to use the instant load at each server to drive the query routing. Also, we describe a new approach to caching, able to incrementally improve the quality of the stored results. Our caching strategy is effectively both in reducing computing load and in improving result quality.

By combining these innovations, we can achieve extremely high figures of precision, with a reduced load w.r.t. full query broadcasting. Our system can cover 65% results offered by a centralized reference index (competitive recall at 5), with a computing load of only 15.6%, i.e. a peak of 156 queries out of a shifting window of 1000 queries. This means about 1/4 of the peak load reached when broadcasting queries. The system, with a slightly higher load (24.6%), can cover 78% of the reference results.

The proposed architecture, overall, presents a trade-off between computing cost and result quality, and we show how to guarantee very precise results in face of a dramatic reduction to computing load. This means that, with the same computing infrastructure, our system can serve more users, more queries and more documents.

Reviews and Comments to This Thesis

Per your invitation to assess the quality of Diego Puppin's thesis draft, I provide the following comments. Let me begin by applauding Diego Puppin for a solid, innovative, and interesting thesis, one that advance significantly a domain of distributed architectures for search engines. By exploiting a novel collection selection approach, his system efficiently and effectively answer queries including under heavy load. His approach sustains sufficient computational power economically (requiring fewer servers), while preserving search accuracy.

Diego outlines the benefits of a trade-off between result quality and computing costs. Having worked closely with leading industrial people in the Web search community, I am well aware of the need to balance search accuracy and cost particularly to meet scalability concerns. The solution proposed in this thesis can be used to absorb unexpected query peaks, or to reduce the computing load in case of failure. At the same time, it can be used to offer even better results via his novel caching approach.

These results are reached through some novel contributions: a new document model, justified by information theory, which leads to a more effective document clustering; a new collection selection function, able to outperform the established CORI algorithm; a new vision of collection selection, called collection prioritization, that permits the system to stop contacting IR cores, if they are overloaded and not relevant for a query; a novel incremental caching technique, able to effectively exploit prioritization.

In his work, Diego outlines the main elements of a distributed search architecture based on collection selection, addressing some of the most critical problems that IR literature is facing today (load balancing, caching, updating) and identifies original solutions. For this reasons, I believe that Diego's work presents a significant and critical result in the field of IR, which is worth a doctoral degree.

I enthusiastically endorse this dissertation.

Ophir Frieder (July 27, 2007)

The thesis addresses a very important and timely topic: distributed indexing of Web pages with the goal of scalable and efficient search and index maintenance. The thesis presents a novel way of data partitioning, by clustering the pages in a specific way and assigning each cluster to one of the servers in the distributed system. At query time, it routes the query to the most promising servers, based on a specific similarity measure between queries and page clusters. This novel architecture is further enhanced by heuristics for load balancing, heuristics for maintaining a cache of query results, and a folding-in technique for updating the distributed index. The thesis emphasizes experimental studies as the means for evaluating the relative performance of different models and algorithms. Overall, this is a nice and insightful piece of original research.

Gerhard Weikum (July 11, 2007)

Diego's work is in the line of current research on high performance architecture for web search engine. I think that his work has improved current schemes for partitioning indices. In fact, the work that he did during his visit to Yahoo! Research was published at Infoscale 2007 and I was one of the co-authors.

His latest results addressed one of the issues that I mentioned in my earlier mail, that is, a detailed analysis on load balancing: it is fundamental to prove that the suggested technique can actually reduce the overall load of a search engine, in a balanced way, while maintaining competitiveness in other issues such as performance, space usage, etc. This issue is one of the main results of the thesis.

Considering the comments above, Diego's thesis is a good contribution to the field of distributed information retrieval.

Ricardo Baeza-Yates (July 10, 2007)

Acknowledgments

Many people have been important to this work, with their guidance, suggestions and corrections. First of all, I would like to thank my advisors. Domenico Laforenza has encouraged, motivated and supported my research during my years at CNR. He has been a great travel mate, and an important guide in my growth as a researcher. Marco Vanneschi, at University of Pisa, has been an important reference since we first collaborated ten years ago.

The HPC lab has been a very nice environment to work at, and I would like to thank everybody there: those guys have been able to bear with me for 4 years! I had a splendid time. They all have been a source of fun, scientific advice, and personal growth.

In particular, I wouldn't have studied Information Retrieval, and probably I wouldn't have been hired by Google, if it wasn't for Fabrizio Silvestri. His support, help, suggestions and ideas have been invaluable to complete this work. He is one of the brightest minds at CNR, and I wish him a most successful career. Raffaele Perego has also been very important for the second part of this research, with his precious comments. My biggest thanks to both of them.

I would like also to greatly thank Ricardo Baeza-Yates, for inviting me at Yahoo! Research in Barcelona and for supporting (and co-authoring) my recent results. His collaboration has been very fruitful and I hope to keep in touch with such a great researcher and person.

Moreover, thanks to the external reviewers, Ophir Frieder and Gerhard Weikum, for taking the effort to contribute to this work by reviewing it: I am really honored of having their endorsement. I am also grateful to Abdur Chowdury and all the colleagues worldwide that contributed to this thesis with their comments on my papers and my presentations.

History is repeating. Like seven years ago, when completing my thesis for my \emph{Laurea}, I am ready to pack to move to Boston. Again! So, as I did seven years ago, I would like to thank all the long-time friends in Pisa. I have known some of you for about 12 years now, and your friendship has been very valuable to me: this period in Pisa has been really enjoyable. I will never forget you.

This thesis is dedicated to my family. So many thanks to my parents, my brother Cristiano, and Elisa, for their unconditional support: I would have never reached this goal without their help. It has always been important to me to know that there is a safe harbor where I can always rest. Their trust in me has been really motivating. Also, thanks a lot to the Bottinelli family, for welcoming me in their house.

But, most important, I would like to thank Silvia and Giulia for being such a great company to explore the world with! I love you so much.

Coclustering

Our implementation of the co-clustering algorithm by Dhillon et al. is available for download.

TGZ archive TGZ