Date of Award
Master of Science (MS)
College of Science and Mathematics
Thesis Sponsor/Dissertation Chair/Project Chair
With an exponential growth in archival of time-stamped documents such as newswire articles, blog posts and other web-pages, information retrieval (IR) has become a challenging task. The degree of complexity in this IR task increases when these archives cover long time-spans and the terminology in them has undergone significant changes. When users pose queries pertaining to historical information over such document collections, the queries need to be translated, incorporating temporal changes, to provide accurate responses. For example, a query on Sri Lanka should automatically retrieve documents with its former name Ceylon. We call such concepts SITACs i.e., Semantically Identical Temporally Altering Concepts. To discover SITACs from a given corpus, we propose a methodology which integrates natural language processing, association rule mining, and contextual similarity. By using the SITACs discovered, historical queries over text corpora can be addressed effectively. Proposed methodology was experimented with Gutenberg corpus which contains speeches of American presidents since first speech of Mr. George Washington in 1795 to speech of Mr. George W. Bush in 2006. Search engines and IR systems can be benefited by the techniques we provide in this research.
Kaluarachchi, Amal, "The SITAC Approach for Time-Aware Query Translation in Text Archives" (2010). Theses, Dissertations and Culminating Projects. 894.