Retrieval in Digital Libraries

INLS 235

Slide 2

What is Retrieved?

Well-structured data

Strings, values

DBMS

Text Documents

Surrogates

Full text

Other media

Images, audio, video, statistics, code

Multimedia/mixed media

Answers

What are the ‘Sources of Evidence’ for Retrieval?

Words

Author’s words

Indexer’s words (metadata from a controlled vocabulary)

Actions

Human (e.g., publisher metadata, signatures)

System (e.g., date, watermarks)

Links (citations)

Human (citations, links)

System (computed from events or relationships, recommenders)

Retrieval Processes (how are needs expressed?)

Content-Centered Retrieval as Matching Document Representations to Query Representations

User-Centered Information-Seeking Process

Text Retrieval

1. Text retrieval is more complex than data retrieval from DBMS.

2. Distinguish searching for word matches from concept matches.

3. Distinguish subject from keyword search:

Subject:-->Search on a controlled vocabulary (e.g., LC subject headings). The results point to documents.

Keyword-->Search all words in particular fields/text fragments. The results point to documents.

4. Distinguish exact match from partial match (ranked) retrieval

5. Distinguish ASCII/UNICODE objects from bit maps

Approaches to Text Retrieval

1. Surrogate Search: Search a set of predefined words that point to related documents. Requires indexing via some controlled vocabulary.

pros: natural transition from paper systems

cons: limited access; human indexing required

2. Full-Text Search: Search every word in every document.

pros: broaden access; possible to automate indexing

cons: words rather than concepts

3. Knowledge-Based Search: Search a set of concepts that are related to concepts in documents.

pros: improved retrieval

cons: computationally expensive; theoretical at present

Full-Text Search

Full-Text Search:

Search every word (or variant) in the document except stop words.

Use stemming?

Methods:

Text Scanning

Signatures

Indexes (inverted files)

Vectors (term document matrix)

Linkages (link analysis)

Recommendations (explicit or implicit)

Inverted File

Assumption: related objects use same words

Words point to word number, offset, surrogate, or document:

aardvark    *Doc3, Doc 7, Doc45, Doc 67.....

abacus       Doc2, Doc16, Doc33, Doc 45, Doc 67, .....

.

.

.

.

zygote     Doc 7, Doc 33, Doc 67, Doc 123, ....

Find all Documents and then apply logical operators to combine

Query either matches or does not match

* actually Doc3,Para5,Word45

Vectors

Each document (or surrogate) is represented by a vector defined by every word in the collection.

Doc 1 0 0 1 1 0 0 ..... 0

Doc 2 0 0 0 0 1 1 ..... 0

.

Doc 7 1 0 0 1 0 0 ..... 1 (has aardvark and zygote)

.

Doc 33 0 1 0 0 0 0 ..... 1 (has abacus and zygote)

.

Doc 67 1 1 0 0 0 0 ..... 1 (has aardvark, abacus and zygote)

.

Doc N

Queries are expressed as vectors and matched to document vectors. Degrees of matching are possible.

Latent Semantic Indexing

Like vector model, use document-term matrix.

Apply singular value decomposition (SVD) to produce a set of ranked eigenvalues. These represent abstract concepts in the document space.

Select the top eigenvalues (e.g., first 200) and apply to query-document matching (See Efron dissertation)

Retrieves some documents that may not use the query term

Link Analysis: Citations
(relevance based on author judgments)

Some citation assumptions

If A cites (is linked to) B, then more likely that A is related to B than to arbitrary C.

If A and B are cited by (linked from) C, then A and B are more likely to be related than A or B to arbitrary D. (co-citation)

If more objects cite A than cite B, then A is more ‘valued’ than B (citation value)

If highly ‘valued’ object A cites C and less ‘valued’ object B cites C, then A’s citation is more valuable.

Link Analysis

Assumption: related objects are linked

AàB or BàA è A~B

AàB, CàB è A~C

In links, out links

Hubs (lots of out links)

Authorities (lots of in links)

AuàB more important than AàB

Link parameters

‘In links’ (aka backlinks) are citations to the object, ‘out links’ are references to other objects (How to incorporate these distinctions?)

Link distance (number of hops? how to dampen? when to stop?)

Link traversal (number of times selected?)

Text window for a link (How much of the text around a link to consider in algorithms?)

Algorithms

HyperSearch: use links to get text from related objects and enrich the text models

PageRank: use only ‘in links.’ use link matrix of entire web. Weight links recursively based on objects with many ‘in links.’ google

Hyperlink induced text search (HITS): use both in and out links. Recursively define ‘hubs’ (objects that point to many good authorities) and authorities (objects that have many hubs pointing to them). Use portion of the web. Clever

Combining Multiple Sources of Evidence

Text analysis+Indexing+Link analysis

Kiduk Yang’s dissertation work

Add alternative document parameters

Adaptable fusion based on context?

Adaptable fusion based on user feedback and active engagement?

Document Alternatives

Paragraphs, passages

SGML/HTML/XML codes

‘Shape’ of text

Related problems:

text summarization/auto abstracting

auto categorization

question answering

Multimedia: Features for Indexing

Linguistic surrogates

Images

color, texture, luminosity, shape

Video

same as stills but add motion (e.g., optical flow)

Sound

speaker attributes, pitch, duration

Example: Color Histogram

Each pixel has color ‘depth’ (e.g., 16 bits)

Divide image into regions (e.g., 8x8 pixels)

Create a histogram for each region (amount of red, cyan, etc.)

The set of histograms serves as a quantitative representation for the image, allowing comparisons and rankings

Querying awkward (use QBE)

Information-Seeking Process Revisited

Interactive Systems: Agile Views

Overviews

Previews

Shared views

History views

Dynamic queries

Interplay between analytical search and interactive browsing

Digital Library IR Challenges

Across collections

Object granularities (image, collection, finding aid, etc.)?

‘Sub’ controlled vocabularies? (or metadata)

Multimedia features?

Interoperation of metadata?

Diverse user communities (known and unexpected)

Links outside the DL (branding, familiarity)

DL IR Challenges (cont’)

Search and Browse functionality balance

Keyword vs directory searching (The Bruza et al reading)

Collection maintenance (reindexing)

additions & deletions

Corrections (e.g., BLS)

User interfaces

Evaluation

Resources

Search engine watch http://searchenginewatch.com/

Keith van Rijsbergen’s book: Information Retrieval

http://www.dcs.gla.ac.uk/Keith/Preface.html

SIGIR http://www.acm.org/sigir


	Well-structured data
		Strings, values
		DBMS
	Text Documents
		Surrogates
		Full text
	Other media
		Images, audio, video, statistics, code
		Multimedia/mixed media
	Answers


	Words
		Author’s words
		Indexer’s words (metadata from a controlled vocabulary)
	Actions
		Human (e.g., publisher metadata, signatures)
		System (e.g., date, watermarks)
	Links (citations)
		Human (citations, links)
		System (computed from events or relationships, recommenders)


	1. Text retrieval is more complex than data retrieval from DBMS.

	2. Distinguish searching for word matches from concept matches.

	3. Distinguish subject from keyword search:
	Subject:-->Search on a controlled vocabulary (e.g., LC subject headings). The results point to documents.
	Keyword-->Search all words in particular fields/text fragments. The results point to documents.

	4. Distinguish exact match from partial match (ranked) retrieval

	5. Distinguish ASCII/UNICODE objects from bit maps


	1. Surrogate Search: Search a set of predefined words that point to related documents. Requires indexing via some controlled vocabulary.
	pros: natural transition from paper systems
	cons: limited access; human indexing required

	2. Full-Text Search: Search every word in every document.
	pros: broaden access; possible to automate indexing
	cons: words rather than concepts

	3. Knowledge-Based Search: Search a set of concepts that are related to concepts in documents.
	pros: improved retrieval
	cons: computationally expensive; theoretical at present


				Full-Text Search:
				Search every word (or variant) in the document except stop words.
				Use stemming?

				Methods:
				Text Scanning
				Signatures
				Indexes (inverted files)
				Vectors (term document matrix)
				Linkages (link analysis)
				Recommendations (explicit or implicit)


	Assumption: related objects use same words

	Words point to word number, offset, surrogate, or document:
	aardvark *Doc3, Doc 7, Doc45, Doc 67.....
	abacus Doc2, Doc16, Doc33, Doc 45, Doc 67, .....
	.
	.
	.
	.
	zygote Doc 7, Doc 33, Doc 67, Doc 123, ....

	Find all Documents and then apply logical operators to combine
	Query either matches or does not match
	* actually Doc3,Para5,Word45


	Each document (or surrogate) is represented by a vector defined by every word in the collection.
	Doc 1 0 0 1 1 0 0 ..... 0
	Doc 2 0 0 0 0 1 1 ..... 0
	.
	Doc 7 1 0 0 1 0 0 ..... 1 (has aardvark and zygote)
	.
	Doc 33 0 1 0 0 0 0 ..... 1 (has abacus and zygote)
	.
	Doc 67 1 1 0 0 0 0 ..... 1 (has aardvark, abacus and zygote)
	.
	Doc N

	Queries are expressed as vectors and matched to document vectors. Degrees of matching are possible.


	Like vector model, use document-term matrix.
	Apply singular value decomposition (SVD) to produce a set of ranked eigenvalues. These represent abstract concepts in the document space.
	Select the top eigenvalues (e.g., first 200) and apply to query-document matching (See Efron dissertation)
	Retrieves some documents that may not use the query term


	Some citation assumptions
		If A cites (is linked to) B, then more likely that A is related to B than to arbitrary C.
		If A and B are cited by (linked from) C, then A and B are more likely to be related than A or B to arbitrary D. (co-citation)
		If more objects cite A than cite B, then A is more ‘valued’ than B (citation value)
		If highly ‘valued’ object A cites C and less ‘valued’ object B cites C, then A’s citation is more valuable.


	Assumption: related objects are linked
	AàB or BàA è A~B
	AàB, CàB è A~C
	In links, out links
	Hubs (lots of out links)
	Authorities (lots of in links)
	AuàB more important than AàB


	‘In links’ (aka backlinks) are citations to the object, ‘out links’ are references to other objects (How to incorporate these distinctions?)
	Link distance (number of hops? how to dampen? when to stop?)
	Link traversal (number of times selected?)
	Text window for a link (How much of the text around a link to consider in algorithms?)


	HyperSearch: use links to get text from related objects and enrich the text models
	PageRank: use only ‘in links.’ use link matrix of entire web. Weight links recursively based on objects with many ‘in links.’ google
	Hyperlink induced text search (HITS): use both in and out links. Recursively define ‘hubs’ (objects that point to many good authorities) and authorities (objects that have many hubs pointing to them). Use portion of the web. Clever


	Text analysis+Indexing+Link analysis
		Kiduk Yang’s dissertation work
	Add alternative document parameters
	Adaptable fusion based on context?
	Adaptable fusion based on user feedback and active engagement?


	Paragraphs, passages
	SGML/HTML/XML codes
	‘Shape’ of text
	Related problems:
		text summarization/auto abstracting
		auto categorization
		question answering


	Linguistic surrogates
	Images
		color, texture, luminosity, shape
	Video
		same as stills but add motion (e.g., optical flow)
	Sound
		speaker attributes, pitch, duration


	Each pixel has color ‘depth’ (e.g., 16 bits)
	Divide image into regions (e.g., 8x8 pixels)
	Create a histogram for each region (amount of red, cyan, etc.)
	The set of histograms serves as a quantitative representation for the image, allowing comparisons and rankings
	Querying awkward (use QBE)


	Overviews
	Previews
	Shared views
	History views
	Dynamic queries
	Interplay between analytical search and interactive browsing


	Across collections
		Object granularities (image, collection, finding aid, etc.)?
		‘Sub’ controlled vocabularies? (or metadata)
		Multimedia features?
		Interoperation of metadata?
	Diverse user communities (known and unexpected)
	Links outside the DL (branding, familiarity)


	Search and Browse functionality balance
	Keyword vs directory searching (The Bruza et al reading)
	Collection maintenance (reindexing)
		additions & deletions
		Corrections (e.g., BLS)
	User interfaces
	Evaluation


	Search engine watch http://searchenginewatch.com/
	Keith van Rijsbergen’s book: Information Retrieval
	http://www.dcs.gla.ac.uk/Keith/Preface.html
	SIGIR http://www.acm.org/sigir