In todays networked
information environment, tools to support information retrieval and filtering have become
common. Despite the general utility and popularity of these tools, in many important
respects their performance is mediocre. Text search engines and agent-based filtering
systems make mistakes that are obvious and aggravating to users, and relevant documents
are usually mixed with many others that are totally unrelated. These problems
significantly lower the productivity and effectiveness of people using the tools, whether
in education, science, business, or government. We believe that the fundamental issue that
underlies all of these problems is the lack of adequate models of the user and the domain.
In order to achieve breakthroughs in retrieval and filtering accuracy, the tools need to
be able to use more information about the context of the query, better models of the user,
and more knowledge about the domain.
User models and
models of topics or domains are not new. A number of studies in the past 20 years have
examined different approaches and implementations. In general, these studies did not have
a significant impact on the design of retrieval and filtering systems, despite the obvious
relevance of user modeling to such systems. We believe that some reasons for this lack of
impact are that previous studies were unable to specify precisely how such models would be
used to affect performance, that there were severe problems with how the data for such
models would be elicited, and that there was no well-defined structure within which such
models could be implemented.
In this proposal, we describe a new
approach to user and domain or topic modeling that has the potential of significantly
improving the effectiveness of information access and filtering. This approach is based on
recent research on language models for information retrieval. In this approach, it is
assumed that associated with every document or group of documents there are one or more
probability distributions that model how the text in the document can be generated. This
generative model is quite different from the standard probabilistic retrieval models and
has a number of advantages. The key advantages for this project are that language models
appear to capture the important aspects of user and domain modeling that have been
observed in earlier experiments, and that retrieval techniques based on document language
models have been shown to be very effective.
The project we propose combines the
expertise and experience of one group in the development and testing of information
retrieval models and systems, with that of another in user modeling and user studies in
interactive systems. These two groups have a history of successful collaboration in
related domains, which provides a solid basis for the proposed collaborative project. We
describe a number of research issues, potential solutions, and a comprehensive
experimental program that will establish the impact of the proposed approach. The
evaluation of the new techniques can be done partly using standard collections like
TREC,
but will also involve a number of user studies in a laboratory setting, and studies of the
impact on an operational Web search application with large numbers of users.
Slide Show (Power Point)
of Mongrel Project.

This material is based upon work supported by the National Science Foundation under Grant No. 9911942. Any opinions, findings and conclusions or recommendations expressed in this material are those of the
author(s) and do not
necessarily reflect the views of the National Science Foundation (NSF).