A document classifier places documents together in a linear arrangement for browsing or high speed access by human or computerized information retrieval systems. Requirements for document classification and browsing systems are developed from similarity measures, distance measures, and the notion of subject aboutness. A requirement that documents be arranged in decreasing order of similarity as the distance from a given document increases can often not be met. Based on these requirements, information theoretic considerations, and the Gray code, a classification system is proposed that can classify documents without human intervention. It provides a theoretical justification for individual classification numbers moving from broad to narrow topics when moving from left to right in the classification number. A general measure of classifier performance is developed and used to evaluate experimental results comparing the distance between subject headings assigned to documents given classifications from the proposed system and the Library of Congress Classification (LCC) system. Browsing in libraries, hypertext, and databases is usually considered to be the domain of subject searches. The proposed system can incorporate both classification by subject and by other forms of bibliographic information, allowing for the generalization of browsing to include all features of an information carrying unit. One can similarly browse through tables of data.
Return to Losee home page at http://www.ils.unc.edu/~losee