Date of Award

5-2010

Document Type

Thesis

Degree Name

Master of Science (MS)

College/School

College of Science and Mathematics

Department/Program

Computer Science

Thesis Sponsor/Dissertation Chair/Project Chair

Aparna Varde

Committee Member

Anna Feldman

Committee Member

Jing Peng

Abstract

With an exponential growth in archival of time-stamped documents such as newswire articles, blog posts and other web-pages, information retrieval (IR) has become a challenging task. The degree of complexity in this IR task increases when these archives cover long time-spans and the terminology in them has undergone significant changes. When users pose queries pertaining to historical information over such document collections, the queries need to be translated, incorporating temporal changes, to provide accurate responses. For example, a query on Sri Lanka should automatically retrieve documents with its former name Ceylon. We call such concepts SITACs i.e., Semantically Identical Temporally Altering Concepts. To discover SITACs from a given corpus, we propose a methodology which integrates natural language processing, association rule mining, and contextual similarity. By using the SITACs discovered, historical queries over text corpora can be addressed effectively. Proposed methodology was experimented with Gutenberg corpus which contains speeches of American presidents since first speech of Mr. George Washington in 1795 to speech of Mr. George W. Bush in 2006. Search engines and IR systems can be benefited by the techniques we provide in this research.

File Format

PDF

Share

COinS