Date of Award

5-2021

Document Type

Thesis

Degree Name

Master of Science (MS)

College/School

College of Science and Mathematics

Department/Program

Computer Science

Thesis Sponsor/Dissertation Chair/Project Chair

Christopher Leberknight

Committee Member

Bharath Samanthula

Committee Member

Boxiang Dong

Abstract

Artificial intelligence (AI) remains a crucial aspect for improving our modern lives but it also casts several social and ethical issues. One issue is of major concern, investigated in this research, is the amount of content users consume that is being generated by a form of AI known as bots (automated software programs). With the rise of social bots and the spread of fake news more research is required to understand how much content generated by bots is being consumed. This research investigates the amount of bot generated content relating to COVID-19. While research continues to uncover the extent to which our social media platforms are being used as a terrain to spread information and misinformation, there still remain issues when it comes to distinguishing between social bots and humans that spread misinformation. Since online platforms have become a center for spreading fake information that is often accelerated using bots this research examines the amount of bot generated COVID-19 content on Twitter. A hybrid approach is presented to detect bots using a Covid-19 dataset of 71,908 tweets collected between January 22nd, 2020 and April 2020, when the total reported cases of Covid-19 were below 600 globally. Three experiments were conducted using user account features, topic analysis, and sentiment features to detect bots and misinformation relating to the Covid-19 pandemic. Using Weka Machine Learning Tool, Experiment I investigates the optimal algorithms that can be used to detect bots on Twitter. We used 10-fold cross validation to test for prediction accuracy on two labelled datasets. Each dataset contains a different set (category 1 and category 2) of four features. Results from Experiment I show that category 1 features (favorite count, listed count, name length, and number of tweets) combined with random forest algorithm produced the best prediction accuracy and performed better than features found in category 2 (follower count, following count, length of screen name and description length). The best feature was listed count followed by favorite count. It was also observed that using category 2 features for the two labelled datasets produced the same prediction accuracy (100%) when Tree based classifiers are used.

To further investigate the validity of the features used in the two labelled datasets, in Experiment II, each labelled dataset from Experiment I was used as a training sample to classify two different labelled datasets. Results show that Category 1 features generated a 94% prediction accuracy as compared to 60% accuracy generated by category 2 features using the Random Forest algorithm. Experiment III applies the results from Experiment I and II to classify 39,091 account that posted Coronavirus related content. Using the random forest algorithm and features identified Experiment I and II, our classification framework detected 5867 out of 39,091 (15%) account as bots and 33,224 (85%) accounts as humans. Further analysis revealed that bot accounts generated 30% (1949/6446) of Coronavirus misinformation compared to 70% of misinformation created by human accounts. Closer examination showed that about 30% of misinformation created by humans were retweets of bot content. In addition, results suggest that bot accounts were involved in posting content on fewer topics compared to humans. Our results also show that bots generated more negative sentiments as compared to humans on Covid-19 related issues. Consequently, topic distribution and sentiment may further improve the ability to distinguish between bot and human accounts.

File Format

PDF

Share

COinS