Robert M. Losee,
Term Dependence: A Basis for Luhn and Zipf Models,
Journal of the American Society for Information Science and Technology,
52 (12), pp. 1019-1025, 2001.
Text of full article (in pdf)

Abstract:

There are regularities in the statistical information provided by natural language terms about neighboring terms. We find that when phrase rank increases, moving from common to less common phras es, the value of the expected mutual information measure (EMIM) between the terms regularly decreases. Luhn's model suggests that mid-range terms are the best index terms and relevance discriminators. We suggest reasons for this principle based on the empirical relationships shown here between the rank of terms within phrases and the average mutual information between terms, which we refer to as the Inverse Representation--EMIM principle. We also suggest an Inverse EMIM term weight for indexing or retrieval appl ications that is consistent with Luhn's distribution. An information theoretic interpretation of Zipf's Law is provided. Using the regularity noted above, we suggest that Zipf's Law, a power law, is a consequence of the statistical dependencies that exist between terms, described here using information theoretic concepts.

Return to Losee home page at http://www.ils.unc.edu/~losee