|
In this paper, we focused on the task of automatically reducing
verbose spoken queries using query performance predictors as input
features for a regression model. The model was trained to predict the
difference in performance between the candidate sub-query and the
original.
The spoken queries were gathered using Amazon Mechanical Turk (MTurk),
and are based on the 250 TREC topics used in the TREC 2004 Robust
Track.
Our study participants were given search task descriptions that were slightly
modified from the original TREC topic description/narrative. Our goal was
to give participants search tasks that were situated in a
“real world” scenario.
For example, TREC topic 395 is associated with the following TREC
description, narrative, and search task description:
TREC description: Provide examples of successful attempts to attract
tourism as a means to improve a local economy.
TREC narrative: To be relevant, a selected document will specify the
entity (city, state, country, governmental unit) which has achieved an
economic increase due to the entity's efforts at boosting tourism.
Documents which only concern plans for increasing tourism are not
relevant, only documents which detail an actual increase are
relevant.
Search Task: You were recently in Costa Rica and were surprised by the
amount of tourists you saw. Now you are curious about other locations
(cities, states, or countries) that have also managed to boost their
tourism industry. Find information about locations that have recently
managed to grow their tourism industry.
Our 250 search task descriptions are provided here.
In total, we gathered 20 spoken queries per search task, for a total
of 5,000 spoken queries. Queries were automatically transcribed using the
AT&T, IBM, and WIT.AI speech-to-text APIs.
The query transcriptions are provided in transcriptions.xlm.
Each query has an ID of the form ., where is in the 1-20 range.
A few additional details:
- There are speech recognition errors.
- There are cases where the API interpreted a long pause as
“end of speech”
- There are cases where the API was not able to resolve the spoken query and
returned a NULL transcription. These are marked as NULL in the XML
file.
|