next up previous
Next: Analytic Models of Performance Up: Measuring Search Engine Quality Previous: Commercial Retrieval Systems and

Measuring Retrieval Performance

There are a number of ways that retrieval performance may be evaluated. A wide range of performance measuring techniques have been recently summarized by Harter and Hert [HH97] and Burgin [Bur99]. Most of the popular measures assume that documents are either relevant or non-relevant, referred to as binary relevance. There is continuing research on types of relevance [Bar94,Sch94,SW95], how individuals use the concept of relevance [TS98], and performance measures that explicitly allow for continuous relevance [Los98]. We will assume here that documents that are members of a certain set of documents may be referred to as relevant and all documents that are not members of this set are referred to as non-relevant. We consider the relevance judgments in experimental databases to be approximations of the relevance judgments that might be provided by an actual user. Performance is most frequently described in terms of precision, the probability that a retrieved document is relevant, and recall, the probability that a relevant document has been retrieved. The performance of a search as it progresses may be shown through use of a precision-recall curve, which shows the qualities of the retrieved set as the search progresses. Precision and recall have been combined into two measures used primarily in the research community, the E and F measures, where E=1-F, and F is the harmonic mean of the precision and recall measures [VR74,Sha86,SBH97]. Another measure, the average search length (ASL), is the average position of relevant documents in the ranked list of documents. A small number represents superior performance, with the relevant documents moved toward the front of the list of ranked documents. Conversely, a large ASL, more than N/2, where N is the number of documents in the database, represents worse than random performance. When ASL = N/2, performance is random. Related to ASL is Cooper's expected search length (ESL), which counts only the non-relevant documents [Coo68]. ESL has an economic interpretation, where non-relevant documents are treated as having a cost when retrieved. The ESL in this case is the average cost associated with retrieving documents, a number that should be minimized by a retrieval system. In addition to computing the ASL as defined above, we may arbitrarily compute the ASL up to points in the search other than the end of the database. When computing the ASL as above, we may say that the ASL is the average position of a relevant document in the ranked list of the first N documents. Other cutoffs may be used to study retrieval performance up to specific points in the ordered set of documents. A small cutoff might be used to study what is often referred to as a high-precision search, while the traditional ASL is essentially the performance using a large part of the full database, a high-recall search.
next up previous
Next: Analytic Models of Performance Up: Measuring Search Engine Quality Previous: Commercial Retrieval Systems and
Bob Losee
1999-07-29