Date of Award
Master of Science (MS)
College of Science and Mathematics
Thesis Sponsor/Dissertation Chair/Project Chair
Eileen M. Fitzpatrick
Collocations are words in English that occur together frequently. Non-native speakers of English tend to confuse certain terms with other similar terms. This causes one of the terms to be substituted with a term that is not commonly used with the other term. “Powerful tea” is an example of such an odd collocation. In this scenario the more commonly used term is “strong tea”.
This paper proposes an approach called CollOrder to detect such odd collocations in written English. CollOrder also provides suggestions to correct the odd collocate. These suggestions are filtered and ranked as top-k suggestions.
We make use of large text corpora such as the American National Corpus (ANC) to identify the common collocations and we use search heuristics to speed the search of collocations and preparing a list of suggestions. We have created a Collocation Frequency Knowledge Base (CFKB). We combine various measures of similarity and frequency of usage using a machine learning classifier to arrive at a formula that can be used to filter and rank the top-k suggestions.
We have implemented a web based solution to evaluate the approach and have considered factors such as caching of intermediate results to enable it to be used in real time.
We claim that our approach would be useful in semantically enhancing Web information retrieval, providing automated error correction in machine translated documents and offering assistance to students using ESL tools.
Varghese, Alan T., "Collocation Error Correction in Web Queries and Text Documents" (2013). Theses, Dissertations and Culminating Projects. 1076.