Retrieval in Digital
Libraries
Slide 2
What is Retrieved?
|
|
|
|
Well-structured data |
|
Strings, values |
|
DBMS |
|
Text Documents |
|
Surrogates |
|
Full text |
|
Other media |
|
Images, audio, video, statistics, code |
|
Multimedia/mixed media |
|
Answers |
What are the ‘Sources of
Evidence’ for Retrieval?
|
|
|
|
Words |
|
Author’s words |
|
Indexer’s words (metadata from a
controlled vocabulary) |
|
Actions |
|
Human (e.g., publisher metadata,
signatures) |
|
System (e.g., date, watermarks) |
|
Links (citations) |
|
Human (citations, links) |
|
System (computed from events or
relationships, recommenders) |
Retrieval Processes (how
are needs expressed?)
Content-Centered
Retrieval as Matching Document Representations to Query Representations
User-Centered
Information-Seeking Process
Text Retrieval
|
|
|
|
|
1. Text retrieval is more complex than
data retrieval from DBMS. |
|
|
|
2. Distinguish searching for word
matches from concept matches. |
|
|
|
3. Distinguish subject from keyword
search: |
|
Subject:-->Search on a controlled
vocabulary (e.g., LC subject headings).
The results point to documents. |
|
Keyword-->Search all words in
particular fields/text fragments. The
results point to documents. |
|
|
|
4. Distinguish exact match from partial
match (ranked) retrieval |
|
|
|
5. Distinguish ASCII/UNICODE objects
from bit maps |
|
|
|
|
Approaches to Text
Retrieval
|
|
|
1. Surrogate Search: Search a set of
predefined words that point to related documents. Requires indexing via some controlled
vocabulary. |
|
pros: natural transition from paper
systems |
|
cons: limited access; human indexing
required |
|
|
|
2. Full-Text Search: Search every word
in every document. |
|
pros: broaden access; possible to
automate indexing |
|
cons: words rather than concepts |
|
|
|
3. Knowledge-Based Search: Search a set
of concepts that are related to concepts in documents. |
|
pros: improved retrieval |
|
cons: computationally expensive;
theoretical at present |
Full-Text Search
|
|
|
|
|
|
Full-Text Search: |
|
Search every word (or variant) in the
document except stop words. |
|
Use stemming? |
|
|
|
Methods: |
|
Text Scanning |
|
Signatures |
|
Indexes (inverted files) |
|
Vectors (term document matrix) |
|
Linkages (link analysis) |
|
Recommendations (explicit or implicit) |
|
|
Inverted File
|
|
|
Assumption: related objects use same
words |
|
|
|
Words
point to word number, offset, surrogate, or document: |
|
aardvark *Doc3, Doc 7, Doc45, Doc 67..... |
|
abacus Doc2, Doc16, Doc33, Doc 45, Doc 67,
..... |
|
. |
|
. |
|
. |
|
. |
|
zygote Doc 7, Doc 33, Doc 67, Doc 123, .... |
|
|
|
Find all Documents and then apply
logical operators to combine |
|
Query either matches or does not match |
|
* actually Doc3,Para5,Word45 |
Vectors
|
|
|
Each document (or surrogate) is
represented by a vector defined by every word in the collection. |
|
Doc 1
0 0 1 1 0 0 ..... 0 |
|
Doc 2
0 0 0 0 1 1 ..... 0 |
|
. |
|
Doc 7
1 0 0 1 0 0 ..... 1 (has
aardvark and zygote) |
|
. |
|
Doc 33
0 1 0 0 0 0 ..... 1 (has abacus and zygote) |
|
. |
|
Doc 67
1 1 0 0 0 0 ..... 1 (has aardvark, abacus and zygote) |
|
. |
|
Doc N |
|
|
|
Queries are expressed as vectors and
matched to document vectors. Degrees
of matching are possible. |
|
|
Latent Semantic Indexing
|
|
|
Like vector model, use document-term
matrix. |
|
Apply singular value decomposition
(SVD) to produce a set of ranked eigenvalues.
These represent abstract concepts in the document space. |
|
Select the top eigenvalues (e.g., first
200) and apply to query-document matching (See Efron dissertation) |
|
Retrieves some documents that may not
use the query term |
Link Analysis:
Citations
(relevance based on author judgments)
|
|
|
|
Some citation assumptions |
|
If A cites (is linked to) B, then more
likely that A is related to B than to arbitrary C. |
|
If A and B are cited by (linked from)
C, then A and B are more likely to be related than A or B to arbitrary
D. (co-citation) |
|
If more objects cite A than cite B,
then A is more ‘valued’ than B (citation value) |
|
If highly ‘valued’ object A cites C and
less ‘valued’ object B cites C, then A’s citation is more valuable. |
Link Analysis
|
|
|
Assumption: related objects are linked |
|
AàB or BàA è A~B |
|
AàB, CàB è A~C |
|
In links, out links |
|
Hubs (lots of out links) |
|
Authorities (lots of in links) |
|
AuàB more important than AàB |
Link parameters
|
|
|
|
‘In links’ (aka backlinks) are
citations to the object, ‘out links’ are references to other objects (How to
incorporate these distinctions?) |
|
Link distance (number of hops? how to
dampen? when to stop?) |
|
Link traversal (number of times
selected?) |
|
Text window for a link (How much of the
text around a link to consider in algorithms?) |
|
|
|
|
Algorithms
|
|
|
HyperSearch: use links to get text from
related objects and enrich the text models |
|
PageRank: use only ‘in links.’ use link matrix of entire web. Weight links recursively based on objects
with many ‘in links.’ google |
|
Hyperlink induced text search (HITS):
use both in and out links. Recursively
define ‘hubs’ (objects that point to many good authorities) and authorities
(objects that have many hubs pointing to them). Use portion of the web. Clever |
Combining Multiple
Sources of Evidence
|
|
|
|
Text analysis+Indexing+Link analysis |
|
Kiduk Yang’s dissertation work |
|
Add alternative document parameters |
|
Adaptable fusion based on context? |
|
Adaptable fusion based on user feedback
and active engagement? |
|
|
Document Alternatives
|
|
|
|
Paragraphs, passages |
|
SGML/HTML/XML codes |
|
‘Shape’ of text |
|
Related problems: |
|
text summarization/auto abstracting |
|
auto categorization |
|
question answering |
Multimedia: Features for
Indexing
|
|
|
|
Linguistic surrogates |
|
Images |
|
color, texture, luminosity, shape |
|
Video |
|
same as stills but add motion (e.g.,
optical flow) |
|
Sound |
|
speaker attributes, pitch, duration |
Example: Color Histogram
|
|
|
Each pixel has color ‘depth’ (e.g., 16
bits) |
|
Divide image into regions (e.g., 8x8
pixels) |
|
Create a histogram for each region
(amount of red, cyan, etc.) |
|
The set of histograms serves as a
quantitative representation for the image, allowing comparisons and rankings |
|
Querying awkward (use QBE) |
Information-Seeking
Process Revisited
Interactive Systems:
Agile Views
|
|
|
Overviews |
|
Previews |
|
Shared views |
|
History views |
|
Dynamic queries |
|
Interplay between analytical search and
interactive browsing |
Digital Library IR
Challenges
|
|
|
|
Across collections |
|
Object granularities (image,
collection, finding aid, etc.)? |
|
‘Sub’ controlled vocabularies? (or
metadata) |
|
Multimedia features? |
|
Interoperation of metadata? |
|
Diverse user communities (known and
unexpected) |
|
Links outside the DL (branding,
familiarity) |
|
|
DL IR Challenges (cont’)
|
|
|
|
Search and Browse functionality balance |
|
Keyword vs directory searching (The
Bruza et al reading) |
|
Collection maintenance (reindexing) |
|
additions & deletions |
|
Corrections (e.g., BLS) |
|
User interfaces |
|
Evaluation |
Resources
|
|
|
Search engine watch http://searchenginewatch.com/ |
|
Keith van Rijsbergen’s book:
Information Retrieval |
|
http://www.dcs.gla.ac.uk/Keith/Preface.html |
|
SIGIR
http://www.acm.org/sigir |
|
|