Next: Experimental Rankings
Up: Measuring Search Engine Quality
Previous: Analytic Models of Performance
Any experiments involving Freestyle and Target must of necessity be ``black
box" experiments [RHB92], since the algorithms
used in these retrieval systems are trade secrets. Based on system
documentation, however, we can conclude that both systems employ algorithms
based on the vector space and probabilistic models, although the exact
values used to calculate relevance remain a mystery. In their evaluation of
Target, Tenopir and Cahn [TC94] state that document weights are adjusted for
document length, but Keen [Kee94] asserts that he did not detect any clear
evidence of such adjustment. Ingwersen [Ing96] suggests that ``Target is
applying quorum logic (in the traditional way), document term frequencies
and collection term frequencies as elements of its ranking algorithm," (p. 45) but
provides no evidence for this claim.
We do know, however, that
Target's ranking algorithm includes at least four variables [Kee94]:
- 1.
- number of
search terms in each record,
- 2.
- proximity of search terms to each other in a
record,
- 3.
- frequency of a term in the database, and
- 4.
- length of the
document.
Freestyle, on the other hand, provides a little more information about the
information retrieval process used. For example, the .WHERE and .WHY screens
in Freestyle show that a term's weight is inversely proportional to its
frequency in the database. In fact, the Freestyle HELP explanation about
query term weights states that ``term importance is based on how frequently
the term appears in the file(s) you are searching. The more often a term
occurs, the lower its term importance." These facts, then, suggest that the
system employs some version of inverse document frequency to calculate term
weights [SJ72]. When calculating the inverse document
frequency weight (IDF), ``terms with medium to low collection frequencies are
assigned high weights as good discriminators, while frequent terms have low
weights" [RSJ76, pp. 129-30].
The matching
algorithm for Freestyle appears to be derived from the vector
space and probabilistic models, where the weight of each document is the sum of
the products of term weights and frequencies of the terms in the document.
The ranking algorithm for Freestyle, then, appears to involve a minimum of three
variables:
- 1.
- frequency of a search term within the database,
- 2.
- frequency of a search term in a record, and
- 3.
- number of search terms in a record.
Next: Experimental Rankings
Up: Measuring Search Engine Quality
Previous: Analytic Models of Performance
Bob Losee
1999-07-29