Maintaining Retrieval Effectiveness in Distributed, Dynamic Information Retrieval Systems.
Charles L. Viles
James C. French, Advisor
Department of Computer Science, University of Virginia
May, 1996

Abstract
Traditional information retrieval (IR) techniques were developed under the tacit assumptions of static, centralized archives of documents. Advanced techniques invariably use information derived from the entire collection in an effort to produce high-quality responses to user queries. In dynamic, distributed information environments these assumptions are clearly not met. Heretofore easily obtainable collection wide information (CWI) may be unavailable to some or all member sites in a distributed document archive, so some degree of incompleteness or inconsistency must be tolerated.

In this dissertation, we present a rigorous empirical study investigating how allowing the view of CWI to drift from rigorously defined values influences retrieval effectiveness. We give a generic model for searching a document collection that allows for the use of CWI derived from a subset of the collection. Within this model, we identify two realistic scenarios where the use of subset-derived collection statistics is likely. The first scenario involves distributed document databases and the second involves ad-hoc search in dynamic document databases.

We view the distributed document archive as a set of collections that know about some fraction of the other collections in the system. We build document collections empirically using standard IR test collections as document sources and parametrically assign these documents to a collection in the system. Our results show that content-skew has a pronounced negative affect on retrieval effectiveness. Content-skew is the degree to which the holdings at a particular site differ from those at another site or a globally-defined ``central'' site made up of the holdings of all members in the system. Document collections that are highly content-skewed require more knowledge about the global collection than those that are content-uniform. However, even in highly skewed systems, sites can know about a relatively small fraction of the holdings at other sites without pronounced degradations in search quality.

We model the dynamic document archive as two collections, an ``old'' collection with complete CWI available, and a ``new'' collection composed of recently inserted documents that have not yet been incorporated into a document index and for which CWI may not be available. Our results show that retrieval effectiveness is maintained for ``new'' collections of realistic size when CWI from the ``old'' collection is used. The only problematic situation is when terms are introduced into the ``new'' collection that are not contained in the ``old'' collection. In these cases, we observed reduced effectiveness for queries containing these terms.

Having defined the notion of content-skew intuitively, we also give two possible methods for measuring it directly. One measure is topic-based and depends upon the availability of representative topic descriptions of the distributed archive. The second measure is statistically based and involves comparing the well-known inverse document frequency statistics between sites. We use one or both of these methods to measure the content-skew of three kinds of document archives: our empirically defined collections, the TREC collection, and the Networked Computer Science Technical Report Library (NCSTRL), an operational distributed archive.


Last modified: Mon May 13 08:41:50 1996