Using Machine Learning to Classify Discipline and Text Type in University Writing
Presentation Type
Poster
Faculty Advisor
Larissa Goulart da Silva
Access Type
Event
Start Date
26-4-2024 2:15 PM
End Date
26-4-2024 3:15 PM
Description
This project aims to explore the extent to which a machine learning model can be used to accurately identify the discipline and communicative purpose of undergraduate student writing. With this goal in mind, we developed a machine learning model based on the linguistic annotation for grammatical features and used logistic regression to build the model. Our data consists of a corpus of 180 undergraduate student written assignments divided into four disciplinary groups (arts and humanities, social sciences, life sciences and physical sciences) and three communicative purposes (to argue, to explain, to give a procedural recount). Each text was tagged for different grammatical features with the Biber Tagger. Performance evaluation was carried out using classification reports, which provides metrics such as precision, recall, and F1 score for each class. Based on preliminary results, we can see that precision and recall for the discipline of physical sciences is higher than other disciplines in the corpus. In addition, it seems that both argumentative and procedural recounts assignments can be classified with a certain degree of accuracy. Explanation texts, on the other hand, have below optimal results. The results of the classification suggest that there is more linguistic variation within the assignment of explanations than other assignments. Similarly, the discipline of physical sciences also seems to be more stable in terms of linguistic features than the other ones investigated. The classification reports enable the assessment of the model's effectiveness in predicting the subject and communicative purpose of university student writing based on linguistic features.
Using Machine Learning to Classify Discipline and Text Type in University Writing
This project aims to explore the extent to which a machine learning model can be used to accurately identify the discipline and communicative purpose of undergraduate student writing. With this goal in mind, we developed a machine learning model based on the linguistic annotation for grammatical features and used logistic regression to build the model. Our data consists of a corpus of 180 undergraduate student written assignments divided into four disciplinary groups (arts and humanities, social sciences, life sciences and physical sciences) and three communicative purposes (to argue, to explain, to give a procedural recount). Each text was tagged for different grammatical features with the Biber Tagger. Performance evaluation was carried out using classification reports, which provides metrics such as precision, recall, and F1 score for each class. Based on preliminary results, we can see that precision and recall for the discipline of physical sciences is higher than other disciplines in the corpus. In addition, it seems that both argumentative and procedural recounts assignments can be classified with a certain degree of accuracy. Explanation texts, on the other hand, have below optimal results. The results of the classification suggest that there is more linguistic variation within the assignment of explanations than other assignments. Similarly, the discipline of physical sciences also seems to be more stable in terms of linguistic features than the other ones investigated. The classification reports enable the assessment of the model's effectiveness in predicting the subject and communicative purpose of university student writing based on linguistic features.