Introduction: |
Since the arrival of the Internet, librarians have been getting quite a bit of attention. Claims have flown that the library is dead; that all information can be, or will be available via electronic means. Burn the books; throw away the keys. Others are claiming that the new the future will be some new super cyberlibrary.
I hold with the viewpoint that libraries are changing; that they're a long way from dead. In fact, when all of the dust settles from the present Internet romance, it will be discovered that the public needs librarians more than ever. Any data base is only as good as its organization. The librarians are the ones who understand the users and their needs and habits. In the future, librarians will find themselves being teachers, explainers, guides to the Information Superhighway.
In the fall of 1997, all incoming medical students at the University of North Carolina in Chapel Hill will be issued laptop computers. They are to use these computers not only to capture lecture notes, but to become part of the wired community. They will uses these laptops as research links, to run MEDLINE searches, the connect to the electronic reference desk at UNC, and to connect to the Internet. They will also be using these to connect when they are doing their third year rotations, out in the mountains of Western North Carolina or on the Outer Banks. These students will be doing medical research when there is no library and no librarian. They will be uncovering some pretty questionable material. Most of these students are used to having their information prepared for them. Libraries, school boards, and faculty have all researched and selected the traditional educational materials. But on the Net it is a free-for-all. How to access and evaluate information on the Web will become an essential skill for librarians to teach.
Unlike the orderly world of the library collection, the World Wide Web is dynamic, chaotic, often disorganized and includes information of dubious quality. The useful and the useless are lumped together in this huge collection. Academic information is combined with humor, or advertisements, or with personal home pages. To be of any information value, the data must first be organized, be retrievable and be evaluated.
Information Quality: |
Information quality is a slippery subject. Although many might disagree, there is rarely a single absolute truth. In many cases, what is truth to me, may be nonsense to you. The best resources for a medical researcher are useless to the elementary school student and vice versa. However, there are hallmarks of what is consistently "good" information. The most basic requirements of good information are:
To achieve quality in electronic information, it is necessary to be sure that one is retrieving all of the relevant information, and then to determine what of the retrieved information is valuable; what information is free of bias, propaganda, or omissions. This is a two step process:
To achieve quality in electronic information, it is necessary to be sure that one is retrieving all of the relevant information, and then to determine what of the retrieved information is valuable; what information is free of bias, propaganda, or omissions. To have quality information, three things are necessary:
The World Wide Web holds the potential for becoming the greatest repository of knowledge ever created. Different from the traditional library, material on the Web is frequently self-published, stored in quasi-secured repositories, and often, of unknown validity. The government, and it would seem a majority of the American population, favor public access of the Web through public libraries and public schools. Librarians are facing a new set of challenges in helping patrons access and utilize this new medium. Schools and public libraries face three main challenges:
Estimates hold that there are over 50 million HTML pages posted on the Web, and the number of entries is multiplying every day. The untrustworthiness and mediocrity of information resources on the World Wide Web is a well recognized and often discussed dilemma. Problems in using the Internet for academic research are many. These problems are not unique to the Web, but rather are common issues of poor scholarship in any medium, but the ability to publish without a review process has increased the number of occurrences vastly. The most common problems are:
Links to other Web resources may be established without adequate thought for the target's relevance or quality. There is pronounced circularity of links. As with traditional research writing, there is a tendency to list far too many resources. But when this is done in a hyperlinked form, it results in web pages which have too many links to be adequately validated and maintained. The end result is scholarship with broken (rotted) links.
Information Retrieval(or what's this search engine doing?) |
With the Internet having seen an explosive growth in recent years, a number of services have arisen on the Internet to help users search and retrieve documents from servers around the world. There are basically two types of retrieval occurring on the Web: indexes and search engine. To understand the differences between the two, it helps to know some of the terms used within the field of Information Retrieval:
Recall: recall is defined as the number of relevant documents retrieved divided by the total number of relevant documents in the collection. For example, suppose there are 80 documents relevant to widgets in the collection. System X returns 60 documents, 40 of which are about widgets. Then X's recall is 40/80 = 50%. In an ideal world, recall is 100%. However, since this is trivial to achieve (by retrieving all of the documents), a system attempts to maximize both recall and precision simultaneously. [2]
Precision: precision is defined as the number of relevant documents retrieved divided by the total number of documents retrieved. For example, suppose there are 80 documents relevant to widgets in the collection. System X returns 60 documents, 40 of which are about widgets. Then X's precision is 40/60 = 67%. In an ideal world, precision is 100%. Since this is easy to achieve (by returning just one document), a system attempts to maximize both precision and recall simultaneously. [2]
Trying to accomplish a high recall rate on the Internet is difficult, due to the enormous volume of information which must be searched. Precision is virtually impossible to calculate for the Web, since it is impossible to determine the total number of relevant sets which could be retrieved. Using an information retrieval system within a fixed system (with a known limit of data sets), a good search engine should be able to return a high percentage score in both recall and precision, though to some degree, one measurements will serve as the inverse function of the other. This inverse functionality is exacerbated on the Web. One can have quick retrieval, or relevance, but not both.
There is a constant competition to be the "best search engine". Multimedia Magazine runs an annual "Search Robot Test" to determine which is the best. "Best" is defined as "Effectiveness= Precision*Recall." All listed recall numbers between 80 - 100%, with precision ranging from 45-70%. [3] As was discussed earlier, these numbers appear to be a bit arbitrary, since it is impossible to truly quantify the the number of relevant documents on the Web.


The computations contains one very subjective term: "relevant". Only the user will know fully what is relevant, and relevance can vary from one query to the next with the same user. Relevance can be improved through feedback. The process of Relevance Feedback refines the results of a retrieval through asking a further query. The user indicates the most relevant documents from those returned. The system then attempts to find terms common to that subset, and adds them to the old query. More documents are returned using the revised query. On some Web search engines, this can be done by clicking the hyperlink, "find similar documents".
The Directory and the Search Engine: |
There are two major categories of searching tools on the Web: directories (what we usually think of as an index) and search engines. Both require an underlying indexing system for the information retrieval. Building an index can be done by either human or computer. Back in the beginning of Web time, say two or three years ago, it was possible to tell the difference between an directory and a search engine. And it mattered. But today, most searching uses a combination of the two. However, it is good to be aware of the differences so as to understand and evaluate the results.
The directory is what we usually think of as an Index. The index is the format most familiar to library users. The traditional card catalog is an index. Our textbooks and cookbooks and yearbooks have indexes. To create an index, an item is located, evaluated, categorized, and listed. There are a few indexes on the web created through human evaluation. The majority of these are smaller and are fairly subject specific. Yahoo, is the best known of the rated indexes. They claim to have a team of indexers who surf, record and index 1,500 sites daily. Since 1996, a large number of rated or reviewed indexes have appeared in the medical/health science field. Some are associated with a University or other academic institution, others are advertising efforts created by the drug companies.
Indexes typically start with a very broad subject heading, then narrow it down. For example, to look at immunization schedules in Emory University's MedWeb, it would involve moving through these subheadings:
Currently a war is waging among search engines for speed, accuracy and size of index. Excite's contribution to this battle is fairly typical, making the claim,"With 50 million full-text URLs, Excite has indexed the most web pages of any navigation service. We have millions more pages than Infoseek, Alta Vista, Lycos, or Inktomi. Which means that no matter what you're looking for, you've got the best chance of finding it here.
Of course, other services will claim that their index is larger. That's because some companies use misleading methods to count the number of URLs in their index." [5]
The section below, "Health Science Links Worth a Look" lists several indexes with a health sciences focus.
Search Engines use computer-based devices (called 'spiders' or 'robots') to automatically log on to Web pages and index their contents. When one page is complete, these robots follow the links to new Web pages and repeat the process. Typically, robots work in either of two ways:
Using this technology, a search service, such as Lycos or Alta Vista build a proprietary index or database of Web documents. Some search services will scan the entire body of a document; others use hidden tags, created by the Web page's author, called 'meta-tags'. Search services then provide a search engine on the Web. This allows users to input a series of terms or topics, then search through its database for documents containing these terms. The user receives a ranked list of documents which the search engine has determined match the specified criteria.
Search Engines all have one common goal: to provide the searcher with a fast relevant retrieval response. To achieve this, it is necessary to be able to think, one way or another. In response, a number of logic systems have arisen to enable computers to "think".
Boolean Logic: Named after the nineteenth-century mathematician George Boole, Boolean logic is a form of algebra in which all values are reduced to either TRUE or FALSE. Boolean logic is especially important for computer science because it fits nicely with the binary numbering system, in which each bit has a value of either 1 or 0. Another way of looking at it is that each bit has a value of either TRUE or FALSE. [8]
Most librarians are very familiar with Boolean logic. With the Boolean operators, AND, OR and NOT, it is possible to combine terms and establish a relationship between the terms. Traditional information retrieval algorithms have focused on the use of Boolean logic. These function extremely effectively when used within a limited data set, indexed using a controlled vocabulary. MEDLINE is an example of this type of system. MeSH rigidly specifies the vocabulary and combination patterns. All articles are indexed according to this pattern.
While MEDLINE is a highly effective system for an experienced searcher to pull information out of, it can be extremely frustrating for the untrained searcher. Searches can be sloppy and crucial information missed. The rapid growth of electronic information and retrieval systems has pushed researchers to develop a number of new, more flexible searching systems. These are all part of the larger field of artificial intelligence, which is the process of creating software which will allow a computer to think like a person.
Fuzzy logic is the most basic of these newer search systems. In traditional Boolean systems, each term is either part of a set or not. Fuzzy logic is a logical pattern which recognizes more than just true and false values. With fuzzy logic, there is a rendering of true and false probabilities. Fuzzy logic is readily observable in the use of a spell-checker program, which will make probabilistic projections in the form of a list of words to replace the misspelled one.
Artificial Intelligence is the construction of a system for information management which can think in the same way that a human does. Expert systems attempt to perform a task that would otherwise be performed by a human expert. Some expert systems are designed to take the place of human experts, while others are designed to assist them. These work by studying how human experts make decisions and then translating the rules into terms that a computer can understand.
Natural Language Processing is the process of training computers to understand natural human language. Rather than focusing on matching terms or forming logical sets, Natural Language Processing (NLP) involves using a set of concepts to sort out the interrelationships of words. The computer breaks apart the sentence into its semantic parts: nouns, verbs, adjectives, etc., and then it creates links. Since language can be ambiguous, vague, or metaphorical. NLP seeks to compute the relationships between words, giving each a correlate to the words around it. Put into a formula, the computer then makes assumptions based on its logic. [7]
Concept Based Searching is similar to natural language searching in that the software attempts to formulate a concept grasped from a conglomeration of terms with the query. This is related to the field of Thesaurus or Rich Aliasing., where a thesaurus is utilized to build upon a keyword. Connections are assumed between the query work and a keyword. Documents can then be retrieved using the keyword, even though that keyword may not have been present in the initial query.
MEDLINE search - traditional vs Natural Language |
To gain an insight of the differences of search strategies, below is a search run in traditional MEDLINE and a new natural language interface for MEDLINE The search query was asthma treatment for an eight year old girl. I ran the first query with the University of North Carolina Ovid MEDLINE search Interface. Both returned high quantities of hits, but the relevance varied widely. The first four articles retrieved through the Ovid MEDLINe interface were:
By comparison, the first four articles retrieved using the natural language interface were:
It would seem that the natural language interface is not able to discern the main concept (asthma) from the limitor (8 year old girl). The results show the significant emphasis being place upon the limitor instead of the prime topic. This and other similar problems will most likely be resolved in the near future and we will see more use of the natural language search engines.
Selecting a search engine |
A quick overview of the major search engines and their capabilities. It cannot be stressed enough the importance of running a search on more than one engine. Because of the magnitude of resources available on the Web and the difficulties in indexing and retrieving them, no one search engine can consistently find all the relevant documents. By running the same search on several platforms, it is possible to locate a hgiher percentage of relevant documents.
Additionally, it is important to become familiar with a few of your most consistently used search engines to know how their search strategies are managed and how to formulate your query so as to maximize effectiveness. [Table information synopsis 10]
Alta Vista |
http://www.altavista.com |
| Indexes | Web documents and Usenet newsgroups |
| Boolean |
|
| Truncation | An asterisk(*) can be used after three letters for variations of a word |
| Phrase Searching | Enclose phrase in quotes (" "). Capitalize proper names. |
| Other | Ranking done by word count:
|
Excite |
http://www.excite.com |
| Indexes | Web documents, Usenet newsgroups |
| Boolean |
|
| Truncation | Not available |
| Phrase Searching | Enclose phrase in quotes (" ") |
| Search Specific Fields | Not available |
| Fields Indexed | Entire page |
| Other | Relevance marked with a red X |
HotBot |
http://www.HotBot.com |
| Indexes | Web documents, Usenet Newsgroups |
| Boolean |
|
| Truncation | Not available |
| Phrase Searching | Select "phrase search" or enclose phrase in quotes (" ") |
| Search Specific Fields | links,location, (domain or region of the world), media type, date. |
| Fields Indexed | Entire page. |
Infoseek |
http://infoseek.com |
| Indexes | Web documents, gopher, FTP, Usenet groups, FAQ, e-mail addresses, |
| Boolean |
A pipe (|) narrows the search to a specific aspect of the word to the left of the symbol (ie cats|Persians) |
| Truncation | Automatic |
| Phrase Searching | Enclose phrase in quotes (" ") or place hyphens between words. Capitalize names |
| Other | Allows narrowing of query by term searches of retrieved set.
Accepts natural language queries. |
Lycos |
http://lycos.com/ |
| Indexes | Web documents, gopher and ftp sites. |
| Boolean |
A minus sign (-) before a word indicates that a word must NOT appear Custom search: OR< AND< match # of terms |
| Truncation | automatic.
Use a period at the end of the work for an exact match |
| Phrase Searching | Not available |
| Search Specific Fields | Not available |
| Fields Indexed | Titles, headings, subheadings, first 20 lines, 100 most "weighty" words in document. Stopwords ignored. |
Open Text |
http://index.opentext.net/ |
| Indexes | Web Documents, Usenet groups, email, current events |
| Boolean | Power search: AND, OR, BUT NOT, NEAR, FOLLOWED BY (within 80 words) |
| Truncation | Not available |
| Phrase Searching | Default of simple search; Capitalize proper names |
| Search Specific Fields | Summary, title,first level headings, URL |
| Fields Indexed | Entire page, including stopwords. |
Functional comparison [11]
|
Lycos |
Excite |
Webcrawler |
Infoseek |
Alta Vista |
|
| Domain |
Web |
Web, usenet |
Web, gopher |
Web, usenet, email |
Web, usenet |
| Help |
Sufficient |
Good |
Sufficient |
Sufficient |
Good |
| Redundancy check |
Yes |
No |
Yes |
No |
Yes |
| Boolean logic |
Limited |
Limited |
Limited |
Limited |
Yes |
| Concept based search |
No |
Yes |
No |
No |
No |
| Ranking |
Yes |
Yes |
Yes |
Yes |
Yes |
| Proximity search |
No |
No |
Yes |
Yes |
Yes |
| Advanced search screen |
Yes, hard |
No |
No |
No |
Yes |
| Multilingual thesaurus |
No |
No |
No |
No |
No |
| Natural language interface |
No |
No |
No |
No |
No |
For the searcher who does not necessarily believe in precision, there are meta-search engines. These allow one to simultaneously search using several different engines. The results can be viewed as one large file, or separated by search engine. The advantages of this approach is that obscure documents are not missed. The price is a heavy return rate, and lack of precision. Custom search operators generally won't work with the meta-engines, except the basic boolean. [12]
Savvy Search |
http://www.cs.colostate.edu/~dreiling/smartform.html |
| Indexes | WWW Resources, Software, People, Reference, Commercial, Academic, Technical, Reports, Images, News, Entertainment |
| Search Engines Used | WebCrawler, Lycos, Yahoo (top three used - others available from the bottom of the results page.) |
| Other | Specify: all terms(AND); terms as a phrase; any phrase (OR);
Can limit number of hits per search engine |
MetaCrawler |
http://www.metacrawler.com |
| Indexes | Web Documents |
| Search Engines Used | Lycos, WebCrawler, Excite, Alta Vista, Yahoo, HotBot, Galaxy |
| Other | Sort results by relevance or location
Specialized limiters available under configure |
Highway 61 |
http://www.highway61.com |
| Indexes | Web documents |
| Search Engines used | Yahoo, Alta Vista, Lycos, WebCrawler, InfoSeek, Excite |
| Other | Boolean search options |
To most successfully use a search engine to achieve the greatest effect, there are a few guidelines:
Health Sciences Links worth a look |
A large scale site patterned on Yahoo; focuses on medical information; over 5000 sites
Clinical information on the Web with both a table of contents and an index; utilizes the MeSH hierarchy.
Resources and links for reference and reference, issues and policies
Resources relating to the health industry. Categories cover most areas of medicine and allied health.
Peer-reviewed directory of Internet health resources a collaborative efforts of health science librarians at the CIC institutions (the Big Ten schools plus U. Chicago).
A reasonably comprehensive collection of nursing resources.
This comprehensive health sciences site organizes information by medical specialty.
An outstanding guide to Internet clinical medicine resources. Maintained by SLACK, Inc., which also does informational journal for pharmaceutical companies.
This is a highly comprehensive collection of biomedical resources on
the Internet, developed by Emory University.
Links to specific publications, organized alphabetically by title.
Links to Internet resources for individuals with disability, their caregivers, and other health care professionals as well.
Compiled and maintained by the Oregon Health Sciences University contains links to major medical indexes and subject guides on the Internet, also lists Internet resources organized by provider.
Evaluating Web Information |
Information quality is a relative term. It's another case of one man's treasure is another man's trash. What is wonderful information for a ten year old is useless to the doctoral student, and vice versa.
Mankato, Minnesota homepage and Mankato University are the whimsical creations of Don E. Descy, a professor in Library Media Education. At first glance, this appears to be a valid page about Mankato University. Backtracking through the address uncovers that the pages are on the server for the Mankato State University. Logically this is incorrect and humorous information, yet not all information on the Internet will be so clear about its inaccuracy.
The Free Internet Encyclopedia publishes this caution:"The vast majority of information accessible through the FREE Internet Encyclopedia is written and maintained by other people. It might be more accurate to call it a Free Internet Encyclopedic Index (but who can pronounce FIEI?). What this means is that those people should receive full credit for the usefulness of their information. It also means that we can't be responsible for the accuracy of their information.
We search for useful information and will attempt to avoid linking to a document that claims (for example) that Smirnoff is the capital of Russia, but among other things, this involves us knowing that Smirnoff is not the capital of Russia, an assumption which is true in this instance but which is a pretty dicey proposition in general. This means that you have to decide on the accuracy and appropriateness of any information you may access here for your own uses and purposes and the responsibility for such decisions is completely your own."
Identifying information quality is an essential aspect of using the Internet for research. Librarians have developed rules for what constitutes "quality information" in a traditional resource. A commonly accepted text is by Richard E. Bopp and Linda C. Smith, Reference and Information Services. On page 297 of the second edition (1995, Libraries Unlimited, Englewood, Colorado)they list the following evaluation criteria for references materials:
Information Quality Checklist
These are questions that users might ask themselves to evaluate a piece of information on the Internet. Of particular importance is the need for outside verification. Does similar information appear elsewhere, outside this one Web site? Although repeating misinformation will not make it into quality information, the inverse (finding a "fact" in only one place), can be a pointer to inaccuracies or untruths in the information.
1. Scope:
2. Audience:
Audience is a key factor in evaluating site information. information needs to be at a level that the user can understand and assimilate it. Information which is too complex or too simple is often useless.
3. Author:
If the author is not a name you are familiar with, there are unique verification tools available using the Web:
4. Authority or publishing body:
As commercial activity has increased on the Web, marketing has also increased. Many "information" sites are thinly disguised marketing or public relations efforts created by interested corporations. Identifying the publishing body can go a long ways toward understanding the bias (if any) present in the creation of the site.
5. Currency:
Currency is of vital importance in science, medicine and similar fields. It might not be so important in languages or some of the other social sciences.
6. Treatment:
7. Arrangement/ Ease of Use:
The above list is extensive and best suited for a fairly advanced user. The questions will not be applicable to all users in all settings and should be modified to reflect the audience.
The World Wide Web is an extremely dynamic entity. In existence for only a few years, it has substantially affected our society changing many of our traditional notions about information access and information use. For librarians, the challenges have just begun. These are challenges which are going to evolve over the next few years, as the Web itself evolves. One challenge will be to teach the patron how to effectively locate the relevant resources on the Internet; how to conduct a thorough search. While their searches may be nowhere as effective as a trained MEDLINE searcher, they will be conducting the searches now that the tools are in their hands. It will be a growing responsibility of the librarian to teach adequate Web skills. Probably the greatest challenge will be the need to instill in the patron the need to question information. To look at it and really evaluate it. In the past, librarians, editors, publishers have sheltered the public from dubious information by the selection and evaluation process. Now we are going to have to become teachers, to train patrons to become savvy information users themselves.
References and Links |
Links to Information Retrieval:
Spinning a Web Search by Mark Lager A good introductory paper explaining the concepts of Information Retrieval followed by an evaluation of the major search engines.
Beyond Cool: Analog Models for Reviewing Digital Resources by James Rettig in the September/October 1996 issue of Online. This article contains an overview of web information indexing and retrieval issues. The conclusion is Rettig's list of information quality evaluation criteria. This list looks at a traditional criteria, and then tries to adapt it to Internet information. Written in early 1996, it looks at the glib criteria for web site evaluations used by such indexes as Point. Rettig proposes a scholarly approach to site information evaluation.
Evaluating Quality on the Net by Hope N. Tillman, Director of Libraries, Babson College, Babson Park, MA This is an "evolving paper", which is an information quality issue endemic on the Internet. The information was first presented in September of 1995; then revised and re-presented in February of 1996. This version was presented in February of 1997.
Critical Evaluation Surveys Kathy Schrock has designed web evaluation tools for students at the elementary, middle and high school levels.
Teaching Critical Evaluation Skills for World Wide Web Resources, Widener University; Links to Web evaluation checklists in several disciplines, examples of excellence, bibliography of Web evaluation materials. Useful for Information, such as the business information, which is not purely scholarly.
Information Quality; WWW Virtual Library A good index to papers on information quality issues, but some resources are a bit dated (from 1995)
Anyone can (and probably will) put up anything on the Internet; a list of reference evaluation questions adapted from the book, The Savvy Student's Guide to Library Research by Judy Pask, Roberta Kramer, Scott Mandernack.
Evaluating World Wide Web Information; An evaluation checklist form designed for library users. Perdue University Libraries
Bibliography |
Send mail to:
fents@ils.unc.edu
Last updated: May 27, 1997