PET crossing: Investigating cross-lingual generalization of euphemisms in LLMs
Presentation Type
Abstract
Faculty Advisor
Anna Feldman
Access Type
Event
Start Date
25-4-2025 9:00 AM
End Date
25-4-2025 9:59 AM
Description
Euphemisms replace harsh, impolite, or taboo expressions while preserving the intended meaning. Their figurative and context-dependent nature makes them challenging for natural language processing (NLP), a subfield of artificial intelligence that enables computers to understand, process, and generate human language using machine learning techniques. In this study, we investigate the transfer learning abilities of large language models (LLMs) for euphemism detection across languages, examining how models learn from one language and apply that knowledge to another. Specifically, we explore whether lexical overlap influences transfer performance. We categorize potentially euphemistic terms (PETs) as overlapping (shared across languages as direct or near-equivalents) or non-overlapping (lacking equivalents in another language). We expand existing English and Turkish PET datasets with additional annotations and fine-tune XLM-RoBERTa (XLM-R) on each subset using consistent hyperparameters. We then evaluate model performance on both within-language and cross-lingual test splits. We show that the model struggles to generalize between overlapping and non-overlapping PETs within-language. Moreover, while models fine-tuned on Turkish PETs transfer relatively well to English, the reverse is not true, likely due to typological differences between the languages. This is further influenced by differences in category distributions: one language may have fewer PETs in a category but many examples per term, while the other has more PETs but fewer examples per term. These results show that cultural and language differences affect how well LLMs detect euphemisms. Our work provides new datasets of overlapping and non-overlapping PETs and offers insights to improve cross-lingual NLP tasks.
PET crossing: Investigating cross-lingual generalization of euphemisms in LLMs
Euphemisms replace harsh, impolite, or taboo expressions while preserving the intended meaning. Their figurative and context-dependent nature makes them challenging for natural language processing (NLP), a subfield of artificial intelligence that enables computers to understand, process, and generate human language using machine learning techniques. In this study, we investigate the transfer learning abilities of large language models (LLMs) for euphemism detection across languages, examining how models learn from one language and apply that knowledge to another. Specifically, we explore whether lexical overlap influences transfer performance. We categorize potentially euphemistic terms (PETs) as overlapping (shared across languages as direct or near-equivalents) or non-overlapping (lacking equivalents in another language). We expand existing English and Turkish PET datasets with additional annotations and fine-tune XLM-RoBERTa (XLM-R) on each subset using consistent hyperparameters. We then evaluate model performance on both within-language and cross-lingual test splits. We show that the model struggles to generalize between overlapping and non-overlapping PETs within-language. Moreover, while models fine-tuned on Turkish PETs transfer relatively well to English, the reverse is not true, likely due to typological differences between the languages. This is further influenced by differences in category distributions: one language may have fewer PETs in a category but many examples per term, while the other has more PETs but fewer examples per term. These results show that cultural and language differences affect how well LLMs detect euphemisms. Our work provides new datasets of overlapping and non-overlapping PETs and offers insights to improve cross-lingual NLP tasks.
Comments
Poster presentation at the 2025 Student Research Symposium.