Tagset Design, Inflected Languages, and N-gram Tagging

Document Type


Publication Date


Journal / Book Title

Linguistics Journal


This paper explores the relationship between the tagset design and linguistic properties of inflected languages for the task of morphosyntactic tagging. Some information theoretic measures and statistics on these languages are reported which show, unsurprisingly, that the tagsets for morphologically rich languages are larger than tagsets for English and the average tag/token ambiguity is higher. The surprising outcome of the experiments is that for Catalan, Czech, Polish, Portuguese,and Russian – which are considered to be “word order” free languages (to various degrees) – the knowledge about the preceding tag reduces the uncertainty about the tag in question if the detailed tagset is used, but when the tagset is reduced to the size of the English tagset (eliminating the detailed information), the two adjacent tags are relatively independent of each other. The experiments provide additional support to Elworthy (1995)’s results.

Moreover, even though the word order of richly inflected languages is considered to be relatively free, such languages seem to behave like English with respect to context, and therefore, it is concluded that n-gram tagging techniques are well justified for such languages. Experiments with cross-lingual projection of morphosyntax described in Hana et al. (2004); Feldman et al. (2006b,a); Hana et al.(2006) provide additional empirical evidence for this claim.

Journal ISSN / Book ISBN