Next: Discussion and Recommendations
Up: Measuring Search Engine Quality
Previous: Performance Superiority over a
The difficulty associated with an individual query, A, may be compared to other query-specific performance figures in an effort to validate the use of the A measure.
A strong correlation between the measures would support the validity of the proposed A measure.
While a correlation between A and a measure M
may show a relationship, it does not necessarily imply that A is measuring the same phenomenon as M.
In Paris and Tibbo [PT98],
a set of E values are reported that were correlated in our study with A values.
The E values were obtained at the highest recall available for that particular query from the CF database.
An E value was unreported for query number 2 which had no relevant documents.
We conservatively chose to use the worst-case E value (1) as the E
value for this query in this study.
The Spearman rank correlation between the A values and the E values is .523, and the Pearson product moment correlation is .407.
We may interpret these strong correlations as indicating the degree to which the value of a traditional performance measure such as E is due to the difficulty of the individual queries.
There is a positive correlation between the A values and the number of natural language terms in the query,
with the Pearson correlation being .172 and the Spearman rank correlation
being .126.
This suggests that shorter queries produce better results than do longer
queries, which is contrary to the idea that the increased richness obtained with longer queries makes up for the additional noise created by adding terms.
Several factors may be at work here.
Some of the longer queries include details about what the searcher wants,
for example, the clause at the end of query 34, ``... what are their relative
advantages and disadvantages?"
Query 37 adds a second question ``... and what factors contribute to
erroneous results of these tests?"
These longer queries express information needs that are inherently
more abstract and are less topical.
They add little to the performance of a term-matching or weighting search
engine, although these additional clauses are certainly helpful to human
searchers in developing queries and evaluating documents.
The correlation between the number
of terms from a public domain medical dictionary
and the A values was negligible, suggesting that query difficulty isn't
simply a matter of adding or deleting sublanguage terms from natural
language queries.
The unnamed machine readable medical dictionary
was obtained from the PC-SIG library of
public domain software
(Disk 4160, 13th edition, CDROM version) and was manually supplemented to
include most of the specialized medical terms
found in the CF database.
Next: Discussion and Recommendations
Up: Measuring Search Engine Quality
Previous: Performance Superiority over a
Bob Losee
1999-07-29