When do information retrieval systems using 2 document clusters provide better retrieval performance than systems using no clustering (essentially a large single cluster)? We answer this question for one set of assumptions and suggest how this may be studied with other assumptions. The Cluster Hypothesis asks an empirical question about the relationships between documents and user-supplied relevance judgments, while the Cluster Performance Question proposed here focuses on the when and why of information retrieval or digital library performance for clustered and unclustered text databases. This may be generalized to study the relative performance of m versus n clusters, as well as text categorization systems and models.
This may be viewed as looking at the conditions where decision theoretic or probabilistic information retrieval (using the binary independence model) performs better, the same, or worse when documents are in a single cluster (e.g. unclustered) than when they are in two clusters, with those documents having the query feature being in a cluster and those without the feature being in a different cluster. The analytic model for information retrieval is used. The work may be seen as related to the analytic model of distributed information retrieval performance.
Return to Losee home page at http://www.ils.unc.edu/~losee