SILS, U. of North
Carolina, Chapel Hill
INLS-509
(Old INLS-172) -- Information Retrieval
Bob Losee
Manning
302
962-7150
losee
at unc dot edu
Fall 2007
Brief Description:
An introductory survey of information
filtering and retrieval, with an emphasis on developing the student's
understanding of the relationship between the algorithms used by search
engines, the query and document, and system performance. This is an
information science course, not an information technology course. The course
is required for students in the School’s Master’s in Information Science
program and will emphasize basic knowledge useful for those who will be in leadership
positions in a wide range of information professions.
Course WWW links:
http://InformationRetrieval.US
(if you forget, there is link from my home
page)
Course
Outline
Readings below are required except
for those preceded by an asterisk (*) Note that students are never expected
to absorb all the material or understand all the mathematics in
the articles.
Introduction:
Retrieval and Filtering
Losee, Lectures Notes
(available in bookstore), Chapter 1.
Sparck-Jones and Willett, Readings in Information Retrieval ("RIR" below), Morgan
Kaufmann Publishers, 1997. Chapter 1.
* Baeza-Yates and
Ribeiro-Neto, Chapters 4, 10
* Case, Donald, Looking
for Information: A Survey of Research on Information Seeking, Needs, and
Behavior, Academic Press, 2002.
* Sugar, “User-centered
Perspectives of Information Retrieval Research and Analysis Methods,” Annual
Review of Information Science and Technology, 1995, 77-109.
Probability
Losee, Lecture Notes,
Chapter 2.
Students may wish to
consult one or more of the "management science" books in the UNC
libraries.
Indexing,
Document, and Media Representation
Losee, Lecture Notes,
Chapter 3
RIR, Chapter 2, articles
by Joyce and Needham (p. 15); Luhn (p. 21); Doyle (p. 25); Cleverdon (p. 47);
Salton and Lesk (p. 60.)
* Iivonen and Sonnenwald,
“From Translation to Navigation of Different Discourses: a Model of Search Term
Selection during the Pre-online Stage of the Search Process,” Journal of
the American Society for Information Science, 49 (Apr. 1 '98), 312-26.
* Svenonius, "Access
to Nonbook Materials: The Limits of Subject Indexing for Visual and Aural
Languages," Journal of the American Society for Information Science,
45(8) Sept. 94, 600-606.
* Salton and McGill, Introduction
to Modern Information Retrieval, McGraw-Hill, 1983, Chapter 3.
* Salton, Automatic
Text Processing, Addison-Wesley, 1989, Chapter 9.
Retrieval
Performance
RIR, Chapter 3, article by
Saracevic (p. 143.)
RIR, Chapter 4, articles
by Saracevic, Kantor, Chamis, and Trivison (p. 175); Cooper (p. 191);
Tague-Sutcliffe (p. 205); Keen (p. 217.)
* Baeza-Yates and
Ribeiro-Neto, Chapter 3.
Losee, Lecture Notes,
Chapter 4.
* Losee, Lecture Notes,
Chapter 6.
* Van Rijsbergen, Information
Retrieval, 2nd ed., Butterworths, 1979, Chapter 7.
Similarity
and Retrieval Decisions
RIR, Chapter 5, articles
by Cooper(p. 265); Belkin, Oddy, and Brooks (p. 299.)
RIR, Chapter
6, articles by Salton and Buckley (p. 355); Croft and Harper (p. 339.)
RIR, Chapter
7, article by Tenopir and Cahn (p. 446.)
Losee, Lecture Notes,
Chapter 5
* Van
Rijsbergen, Chapters 5 & 6.
Relationships
between Terms, Natural Language Processing
Losee, Lecture Notes,
Chapter 8, 9, 11.
RIR, Chapter 5, article by
Turtle and Croft (p. 287.)
RIR, Chapter 6, article by
Porter (p. 313.)
RIR, Chapter 8, articles
by Salton, Allan, Buckley and Singhal (p. 478); Rau (p. 527); Johnson, Paice,
Black, and Neal (p. 538.)
* Chowdhury, “Natural
Language Processing,” in Annual Review of Information Science and Technology,
2003.
Rule Based
and Logical Systems
Losee, Lecture Notes,
Chapter 10.
* Forsyth and Rada, Machine
Learning: Applications in Expert Systems and Information Retrieval, Wiley,
1986, Chapters 6-14.
Coding and
Compression
* Salton,
1989, Chapters 5 & 6.
* Losee, Science
of Information, 1990, Chapter 2.
Course Evaluation:
Quality of class participation 40%
Critiques of readings 30%
Other homework 30%
Critiques of Readings:
For some articles listed on the course
schedule, students are expected to write a critique of the article of 5
to 10 sentences in length (maximum ¾ page single spaced, 1 page double spaced) and
hand in the critique (on paper, not via email, and use serif fonts for the body
of the text) by the beginning of the class on the due date listed on the schedule. The critiques
should be constructive, emphasizing ways that the research could be
improved or expanded, and might include questions that arose as you read the
article whose answers would be useful, possible research questions that could
be turned into (and are focused enough and small enough to be) SILS Master’s
papers, along with methodologies for addressing these questions. Do not
criticize the author’s writing style or the choice of topic; emphasize how you
personally might be able to expand on the article. The one lowest critique
grade will be dropped, to cover “bad days,” critiques that don’t get handed in
on-time, or sickness.
End of the Semester Proposal
By noon December 7, the Friday
of the last week of class, students will hand in a printed research proposal
based upon one of the critiques they wrote during the semester. The proposal
should contain a clearly stated research hypothesis in the first paragraph.
This research proposal should be 3 to 6 pages, single spaced, and should
include enough discussion of the related literature to base the different aspects
or components of your research hypothesis and how it would be answered in the context
of the research literature.
Information
Retrieval Leadership Proposals:
Each student will develop
three Information Retrieval Leadership Proposals. The Leadership Proposal
areas (due dates for printed proposals are on the class schedule) are
Proposal 1: Expressions
of information needs as queries by individuals or groups; query languages;
means for eliciting information needs.
Proposal 2: Univariate (statistically independent) feature, document, and
query matching and similarities, assuming term independence; indexing (as
viewed from retrieval).
Proposal 3: Multivariate similarity or matching systems; multivariate
reasoning systems; natural language processing.
The proposals are due at
the start of class on the day indicated. Each proposal should be a total of 2
to 4 pages, single spaced. Do not use a sans serif font; these fonts (e.g.
Helvetica or Arial) are designed for headlines and captions, not the body of
text in a paper. As the title for each paper, state clearly what question you
are asking, formulated as an English language question with a question
mark at the end. The proposal should address the nature of the problem, a
discussion of how results and theory in the literature "support" the
problem, methodology, the kinds of results you expect to find, and the usefulness
of the answer to your question. The question and its answer should address
issues bigger than found at one site or one system or one language; the most
useful questions are generic questions that are of the form “is X better than
Y?” Select a question whose answer would make you a leader in IR by suggesting
ways people should make decisions differently or better. Descriptive studies
are acceptable but always considered less useful than constructive studies that
make concrete recommendations. The focus of each proposal needs to be on a
question closely related to the topic for the date, with other information
retrieval system considerations being secondary. Grading will be based upon
how well the proposal addresses the question related to the topic, the
usefulness of the proposed analysis, how answering the question is feasible as
a student 3 credit project or master's paper, and the quality of the proposed
methodology at answering the question. Proposing a small project that leads to
definite knowledge and possible improvement of practice is always better than a
larger project which just amasses data but doesn’t lead to much understanding
and the improvement of practice.
For the first proposal,
your question should not discuss or evaluate a particular information system or
information resource, or the use of a system or systems by users. Propose a
study of information needs independent of how the need might be satisfied
or how searching for an answer takes place. You may look at information use,
but only as a way to study the focus of this proposal, information need. You
might want to think about psychological studies of individuals, to learn how
needs are formulated, felt, or expressed, or you might wish to focus on a
particular functional group and their particularly different needs or
expressions of needs. If you start writing about how a system serves people or
how people search for information, stop.
For the second proposal,
your question should address matters associated with individual terms, either
in the area of indexing or retrieval. You can address multiple term systems;
however, the terms should be treated as independent of each other (as do most
of the retrieval models discussed up to this point in the course).
For the third proposal,
your question should explicitly address systems using the relationships that
exist between document features and consider how this would impact retrieval
performance. Methods of looking at these relationships might include
statistical dependencies, multivariate machine learning techniques, linguistic
(syntactic or semantic) information, or a logical system based on a thesaurus.
Warning: Don’t write on
a topic. You should be writing to show how the methodology will answer the
question you provide. If your methodology won’t provide a definitive (or at
least solid) answer to the question, the question may be too broad and might be
narrowed further. Doing a good job on a professionally relevant but narrow
question is always better than a much weaker answer to a broader question.
Each question-answer combination should show how to lead the field of
information retrieval.
Each student is expected to
conduct a small research project and write up the project in a paper of 4 to 10
pages of text, single spaced, to be handed in on paper. You may use any widely
accepted paper style (e.g., Chicago, APA, MLA). The project should begin with
a question whose answer would be of value to the information retrieval
community. The question is best phrased in the form “Is X better than Y for
Z?” rather than “How and why does Z work?” or “How does X impact Z?” There
should be a brief discussion of the literature addressing areas around the
question, possibly citing 3 to 6 related articles. The question should be
clearly stated in the paper and the paper should focus on answering this
question by drawing conclusions based primarily on the data collected and
analyzed. The research should involve either the manual or automated analysis
of data to be gathered by the student (not from the literature), and it may be
either quantitative or qualitative. Studies must focus on more than one system
(or multiple distributed systems) or more than one user; the focus should be on
knowledge and techniques applicable to a wide range of systems and/or users.
Do not base your data analysis primarily on published data. Implementing a
system or software, or planning to implement such a system, is not acceptable
as the course project; you may wish to perform a study to gain knowledge that
might help outside the course to develop a system, or you might use software
you have developed to test out a hypothesis. The paper should describe and
analyze the results, with an emphasis on interpretation (“why”) leading to an
understanding of the results. Insight into the strengths and weaknesses of the
different techniques or situations is more important than raw performance
improvement. The last paragraph of the paper should contain specific
recommendations for professional practice, as well as summaries of the reasons
for these recommendations.
Criteria for Leadership
Proposals (and Class Participation) Evaluation
This is a required course for the SILS Master’s degree in
Information Science. You are here to learn, not to worry. Anyone who puts in
a reasonable effort should expect to pass the course.
An H paper includes a question whose answer will improve
the operation of more than one information retrieval system. The paper should
include strong reasons for considering the problem important to ILS
professionals; a brief literature review, and a methods section, as well as a
clear explanation or argument about why these results occurred. The
question to be answered should be topically similar to those questions
addressed in journals such as JASIS and IP&M. An H
course grade indicates clear excellence and leadership in the course.
A P paper is a good solid piece of work, at the normal
graduate level, that may be less effective in explaining why the question’s
answer would be useful or in connecting it to central issues in the field; or
it may lack references to relevant literature; or it may lack an obvious
connection between the question and the methods to be used; or it may not
describe the question or the methodology precisely; or it may overlook some
minor methodological problems or fail to discuss or resolve them
satisfactorily. There may be little explanation about why these particular
results occurred. P is the most commonly awarded course grade in
graduate level courses such as this.
An L paper may fail to explain the utility of the research
or it may fail to connect the question to the methods to be used or the
different aspects of methods to each other. Major methodological problems may
have been overlooked. There may be little or no understanding provided as to
the cause of the results.
An F paper is lacking a required element (the question,
relevant literature, research site and/or sources and/or subjects, data
collection and analysis). Any plagiarism or other violation of the Honor Code
will also result in an F and the likelihood of further action.
Each student will develop
three informal IR Leadership proposals. The Leadership proposals areas and due
dates (late proposals penalized!) are
Wed. Oct. 11 Individual
users' information needs, expressions of needs as queries.
Wed. Nov. 8 Univariate feature matching and term independence, indexing.
Wed. Dec. 13 Multivariate systems, reasoning systems, natural language
processing.
The first 2 proposals are
due at the start of class on the day indicated, and the last proposal is due at
noon. Each proposal should be a total of 2 to 4 pages, single spaced. State
clearly what question you are asking, formulated as an English language question
with a question mark at the end. The proposal should address the nature of the
problem, a discussion of how results and theory in the literature
"support" the problem, methodology, the kinds of results you expect
to find, and the importance of your question and approach. The focus of each
proposal needs to be on a question closely related to the topic for the date,
with other information retrieval system considerations being secondary.
Grading will be based upon how well the proposal addresses the question related
to the topic, the usefulness of the proposed research, its feasibility as a
student 3 credit project or master's paper, and the quality of the proposed
methodology. Proposing a small project that leads to definite knowledge and
possible improvement of practice is always better than a larger project which
just amasses data but doesn’t lead to much understanding or the improvement of
practice.
For the first proposal,
your question should not discuss or evaluate a particular information system or
information resource. Propose a study of information needs independent of how
the need might be satisfied or how searching for an answer takes place. You
might want to think about psychological studies of individuals, to learn how
needs are formulated, felt, or expressed, or you might wish to focus on a
particular functional group and their particularly different needs or
expressions of needs. If you start writing about how a system serves people,
stop.
For the second proposal,
your question should address matters associated with individual terms, either
in the area of indexing or retrieval. You can address multiple term systems;
however, the terms should be treated as independent of each other (as do most
of the retrieval models discussed up to this point in the course).
For the third proposal,
your question should explicitly address systems using the relationships that
exist between document features and consider how this would impact retrieval
performance. Methods of looking at these relationships might include
statistical dependencies, linguistic (syntactic or semantic) information, or a
logical system based on a thesaurus.
Warning: Don’t write on
a topic. You should be writing to show how the methodology will answer the
question you provide. If your methodology won’t provide a definitive (or at
least solid) answer to the question, the question may be too broad and might be
narrowed further. Doing a good job on a professionally relevant but narrow
question is always better than a much weaker answer to a broader question.
Sources of Information
on Information Filtering & Retrieval
Serials:
The major serials covering IR include Information
Processing and Management (formerly Information Storage and Retrieval),
Journal of the American Society for Information Science and Technology (formerly
JASIS and before that American Documentation), Journal of
Documentation, IEEE Trans on Pattern Analysis and Machine Intelligence,
IEEE Trans on Date and Knowledge Engineering, ACM Transactions on
Information Systems, and Information Retrieval.
Conference
Proceedings:
The ACM Special Interest
Group in Information Retrieval (SIGIR) has held annual conferences since 1980.
The conference is usually held in North America in odd years, outside North America even years. Some European conferences have been published as
"books." Most of the ACM SIGIR conference proceedings are in the
ACM Digital Library and can be accessed through the library web page.
Monographs:
(** Best works or classics marked with
asterisks)
Baldi and Brunak, Bioinformatics:
The Machine Learning Approach, MIT, 2001.
Baldi, Frasconi, and Smyth, Modeling
the Internet and the Web, Wiley, 2003.
Baeza-Yates and Ribeiro-Neto, Modern
Information Retrieval, Addison Wesley, 1999.
Case, Donald, Looking for
Information: A Survey of Research on Information Seeking, Needs, and Behavior,
Academic Press, 2002.
Chen, Li, and Wang, Machine Learning
and Statistical Modeling Approaches to Image Retrieval, Kluwer, 2004.
Chu,
Heting, Information Representation and Retrieval in the Digital Age,
ASIS, 2003.
Feldman, R. and Sanger, J. The Text
Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge U. Press, 2006.
** Foskett, A. C., The Subject
Approach to Information, London, Lib. Assoc. Publ, 1996.
Forsyth and Rada, Machine Learning;
Applications in Expert Systems and Information Retrieval, Wiley, 1986.
** Frakes and Baeza-Yates, eds., Information
Retrieval: Data Structures & Algorithms, Prentice Hall, 1992.
Frants, Shapiro, and Voiskunskii, Automated
Information Retrieval, Academic Press, 1997.
Grossman and Frieder, Information
Retrieval: Algorithms and Heuristics, Second edition, Springer-Verlag,
2004.
Grefenstette, Cross-Language
Information Retrieval, Kluwer, 1998.
Korfhage, Information Storage and
Retrieval, Wiley, 1997.
Kowalski and Maybury, Information
Storage and Retrieval Systems, Kluwer, 2000.
Langville and Meyer, Google’s
PageRank and Beyond: The Science of Search Engine Rankings, Princeton, 2006.
Losee, Text Retrieval and Filtering,
Kluwer, 1998.
Manning, Raghaven, and Schutze. Introduction
to Information Retrieval, Cambridge, 2008.
** Manning and Schutze, Foundations of
Statistical Natural Language Processing, MIT Press, 1999.
Maybury, M., Ed., Intelligent
Multimedia Information Retrieval, AAAI/MIT Press, 1997.
Salton, Automatic Text Processing,
Addison-Wesley, 1989.
** Salton and McGill, Introduction
to Modern Information Retrieval, McGraw Hill, 1983
Sparck Jones and Willett, Information
Retrieval, Morgan Kaufmann Publishers, 1997.
Van Rijsbergen, Geometry of
Information Retrieval, Cambridge, 2004.
** Van Rijsbergen, Information
Retrieval, Second Edition, Butterworth, 1979.
Wu, Xiong, and Shekhar, Clustering
and Information Retrieval, Kluwer, 2004.
Honor Code:
Students should familiarize themselves
with the University of North Carolina at Chapel Hill Honor Code that is
described in University publications. It should be noted that in this course,
students are expected to receive (and provide) some assistance regarding the
use of hardware and software in the laboratories and general problem solving
techniques for homework assignments. Students should NOT receive (or provide)
major creative assistance or continuing minor support for projects.
Plagiarism:
Student assignments that are handed in
that contain more than 5 consecutive words that the instructor feels were taken
from another source without proper attribution (without the proper quote marks
and citations) definitely will be referred to the appropriate
administrative authorities who address issues of Academic Integrity (e.g. the Honor
Court) I assume that all students are equally likely to be honest and
will put an equal amount of effort into considering the possibility of
plagiarism for each student’s paper.
Classroom Behavior:
Separate from the Honor Code but
related to respect for classmates is classroom behavior, which will be a factor
in your class participation grade. Students are expected to behave in a
professional manner in class. Students in class are expected to focus on
classroom materials. Students are expected to avoid student-to-student
conversations during class. Use of laptop computers should be limited to
taking notes for class and to using class related materials. Similarly,
materials being read should be limited to those appropriate for the classroom
lecture or discussion. Students who appear to be involved in non-class related
activities during class time will be graded as not participating in class. Cellular
telephones and computers should have speakers or other audio devices muted
before class begins so as to not disturb others.