Document Type

Article

Publication Date

2008

Journal / Book Title

Linguistics Journal

Abstract

This paper explores the relationship between the tagset design and linguistic properties of inflected languages for the task of morphosyntactic tagging. Some information theoretic measures and statistics on these languages are reported which show, unsurprisingly, that the tagsets for morphologically rich languages are larger than tagsets for English and the average tag/token ambiguity is higher. The surprising outcome of the experiments is that for Catalan, Czech, Polish, Portuguese,and Russian – which are considered to be “word order” free languages (to various degrees) – the knowledge about the preceding tag reduces the uncertainty about the tag in question if the detailed tagset is used, but when the tagset is reduced to the size of the English tagset (eliminating the detailed information), the two adjacent tags are relatively independent of each other. The experiments provide additional support to Elworthy (1995)’s results.

Moreover, even though the word order of richly inflected languages is considered to be relatively free, such languages seem to behave like English with respect to context, and therefore, it is concluded that n-gram tagging techniques are well justified for such languages. Experiments with cross-lingual projection of morphosyntax described in Hana et al. (2004); Feldman et al. (2006b,a); Hana et al.(2006) provide additional empirical evidence for this claim.

Journal ISSN / Book ISBN

ISSN: 17182298

MSU Digital Commons Citation

Feldman, Anna, "Tagset Design, Inflected Languages, and N-gram Tagging" (2008). Department of Linguistics Faculty Scholarship and Creative Works. 8.
https://digitalcommons.montclair.edu/linguistics-facpubs/8

Rights

This article is under copyright by the author and made available for educational use only.

Published Citation

Feldman, A. (2008). Tagset Design, Inflected Languages, and N-gram Tagging. Linguistics Journal, 3(1), 151–173.

Download

Link to Publisher

Link at MSU

Included in

Computational Linguistics Commons

COinS

Department of Linguistics Faculty Scholarship and Creative Works

Tagset Design, Inflected Languages, and N-gram Tagging

Document Type

Publication Date

Journal / Book Title

Abstract

Journal ISSN / Book ISBN

MSU Digital Commons Citation

Rights

Published Citation

Included in

Search

Browse

Author Corner

Links

Department of Linguistics Faculty Scholarship and Creative Works

Tagset Design, Inflected Languages, and N-gram Tagging

Authors

Document Type

Publication Date

Journal / Book Title

Abstract

Journal ISSN / Book ISBN

MSU Digital Commons Citation

Rights

Published Citation

Included in

Share

Search

Browse

Author Corner

Links

//<![CDATA[ document.write("<a href='mailto:" + "digitalcommons" + "@" + "mail.montclair.edu" + "'>" + "Contact Us" + "<\/a>") //]]>