Robert M. Losee,
Preface of Text Retrieval and Filtering: Analytic Models of Performance, Boston: Kluwer, 1998.

Information retrieval scholars have long studied retrieval performance experimentally. While document ranking procedures consistent with formal models have been developed, little work has been done showing how combining a given ranking procedure, query, and database results in a particular level of performance. In this book we describe means by which information retrieval may be studied analytically, allowing one to describe current performance, predict future retrieval performance and to understand why systems perform as they do. Specific means for computing the expected performance are developed, having at their core the average search length performance measure, the expected position of a relevant document in a ranked list of documents. This measure has an advantage over many other performance measures in that it is relatively easy to predict.

This work focuses on the performance of systems that retrieve natural language text, considering full sentences as well as phrases and individual words. The last chapter explicitly addresses how grammatical constructs and methods may be studied in the context of retrieval or filtering system performance. The book builds toward solving this problem, although the material in earlier chapters is as useful to those addressing non-linguistic, statistical concerns, as it is to linguists. Those interested in grammatical information should be cautioned to carefully examine earlier chapters, especially Chapters 7 and 8, which discuss purely statistical relationships between terms, before moving on to Chapter 10, which explicitly addresses linguistic issues.

Unambiguous statements of the conditions under which one method or system will be more effective than another are developed. After a simple measure of performance is developed, techniques for predicting performance are proposed, first for single term queries, and then for multiple term queries, with and without linguistic knowledge. Many of the most important results are presented in the text as theorems, with supporting arguments that the author hopes will be convincing to the reader, while avoiding some details one would find in more formal mathematical proofs. Corollaries provide assertions that may be less important than the theorems on which they are based. We also provide conjectures that the author believes to be true but that may need both theoretical and experimental results if they are to be generally accepted. Together, these techniques and results provide tools for studying individual systems or comparing the performance of different systems.

The best way to learn the analytic techniques that serve as the basis for these text filtering and retrieval models is through manipulating the models, teaching one the techniques and assumptions underlying them. This may be done through the book's exercises and by paper-and-pencil work with the models. Symbolic math packages, such as Mathematica, Maple, and Macsyma, can be used to develop both symbolic and numeric solutions to more complex equations, enabling one to expand a model or to verify it using an experimental database. Multi-term analytic models can become quite large, and coding them using symbolic mathematics packages on a computer or a high-end calculator may lead the reader to a deeper understanding of these models than might be obtained through arduous and error-prone paper-and-pencil manipulations. These computer packages are useful learning and computing tools and have been used in the production of a few of the more complex analytic formulae in the book, as well as all of the graphics.

We provide a number of exercises of varying degrees of difficulty. Some, marked easy, are relatively easy and normally will require a few minutes worth of effort to answer. All exercises marked easy should be completed by the reader who wishes to fully integrate the material in the preceding section. Other exercises, marked moderate, are of moderate difficulty and might take ten minutes to an hour or two for most people to solve. More difficult questions are marked with difficult and may take significantly more effort and require more insight than most homework problems. Research problems are marked research. These research exercises are significant problems that can be answered analytically using methods described here and whose answers would advance knowledge about text filtering and retrieval. Scholars reading this work should examine each research problem and consider it, however briefly, before moving on. We have also included some exercises that examine experimentally those aspects of retrieval and filtering where knowledge can be gained only through the gathering of empirical data.

Return to Losee home page at http://www.ils.unc.edu/~losee