
Abstract:
A probabilistic documentretrieval system or search engine may be seen as consistent with a sequential learning process, in which the system learns the characteristics of relevant documents, or, more formally, it infers the parameters of probability distributions describing the frequencies of feature occurrences in relevant and nonrelevant documents. Probability distributions that may be used to describe the distribution of features include binary and Poisson distributions. Techniques for estimating the parameters of distributions are suggested and are used in the production of term weights. We have tested a proposal that parameters of distributions describing the distribution of features in nonrelevant documents be estimated from the parameters of the corresponding distribution for the entire database; the confidence parameters of such an estimate, resulting in the highest average precision, are given. Tests of several methods for estimating the parameters of distributions describing the distribution of features in relevant documents suggest that it is best to use small values of the confidence parameters in initial estimates of parameters for relevant documents.
For a related work comparing term independence and term dependence, take a look at: Losee, Bookstein, and Yu, "Probabilistic Models for Document Retrieval: A Comparison of Performance on Experimental and Synthetic Databases." Proceedings of the ACM SIGIR Conference (Pisa, Italy), New York: ACM Press, 1986, p. 258264. Fulltext of article in ACM Digital Library
Return to Losee home page at http://ils.unc.edu/~losee