"Tagset Design, Inflected Languages, and N-gram Tagging" by Anna Feldman

Department of Linguistics Faculty Scholarship and Creative Works

Title

Tagset Design, Inflected Languages, and N-gram Tagging

Authors

Anna Feldman, Montclair State UniversityFollow

Document Type

Article

Publication Date

2008

Journal / Book Title

Linguistics Journal

Abstract

This paper explores the relationship between the tagset design and linguistic properties of inflected languages for the task of morphosyntactic tagging. Some information theoretic measures and statistics on these languages are reported which show, unsurprisingly, that the tagsets for morphologically rich languages are larger than tagsets for English and the average tag/token ambiguity is higher. The surprising outcome of the experiments is that for Catalan, Czech, Polish, Portuguese,and Russian – which are considered to be “word order” free languages (to various degrees) – the knowledge about the preceding tag reduces the uncertainty about the tag in question if the detailed tagset is used, but when the tagset is reduced to the size of the English tagset (eliminating the detailed information), the two adjacent tags are relatively independent of each other. The experiments provide additional support to Elworthy (1995)’s results.

Moreover, even though the word order of richly inflected languages is considered to be relatively free, such languages seem to behave like English with respect to context, and therefore, it is concluded that n-gram tagging techniques are well justified for such languages. Experiments with cross-lingual projection of morphosyntax described in Hana et al. (2004); Feldman et al. (2006b,a); Hana et al.(2006) provide additional empirical evidence for this claim.

Journal ISSN / Book ISBN

1718-2298

MSU Digital Commons Citation

Feldman, Anna, "Tagset Design, Inflected Languages, and N-gram Tagging" (2008). Department of Linguistics Faculty Scholarship and Creative Works. 8.
https://digitalcommons.montclair.edu/linguistics-facpubs/8

Link to Full Text

COinS

Department of Linguistics Faculty Scholarship and Creative Works

Title

Authors

Document Type

Publication Date

Journal / Book Title

Abstract

Journal ISSN / Book ISBN

MSU Digital Commons Citation

Search

Browse

Author Corner

Links

Department of Linguistics Faculty Scholarship and Creative Works

Title

Authors

Document Type

Publication Date

Journal / Book Title

Abstract

Journal ISSN / Book ISBN

MSU Digital Commons Citation

Search

Browse

Author Corner

Links

//<![CDATA[ document.write("<a href='mailto:" + "digitalcommons" + "@" + "mail.montclair.edu" + "'>" + "Contact Us" + "<\/a>") //]]>