Date of Award
Spring 5-2021
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
School
Computing Sciences and Computer Engineering
Committee Chair
Dr. Andrew Sung
Committee Chair School
Computing Sciences and Computer Engineering
Committee Member 2
Dr. Iliyan Iliev
Committee Member 2 School
Social Science and Global Studies
Committee Member 3
Dr. Bo Li
Committee Member 3 School
Computing Sciences and Computer Engineering
Committee Member 4
Dr. Ramakalavathi Marapareddy
Committee Member 4 School
Computing Sciences and Computer Engineering
Committee Member 5
Dr. Chaoyang Zhang
Committee Member 5 School
Computing Sciences and Computer Engineering
Abstract
Arabic is one of the most widely used languages in the world, but due in part to its morphological and syntactic richness, resources for automated processing of Arabic are relatively rare. Arabic takes three primary forms: Classical Arabic as seen in the Qur’an and other classical texts; Modern Standard Arabic (MSA) as seen in newspapers, formal documents, and other written text intended for widespread distribution; and dialectal Arabic as used in common speech and informal communication. Social media posts are often written in informal language and may include non-standard spellings, abbreviations, emoticons, hashtags, and emojis. Dialectal Arabic is commonly used in social media.
Semantic classification is the task of assigning a label to a text based on its primary semantic content. Given the increased use of dialectal Arabic on social media platforms in recent years, there is an urgent need for semantic classification of dialectal Arabic. Even compared to MSA there are few resources for automated processing of dialectal Arabic. The prior work dealing with automated processing of dialectal Arabic are limited to only one or two dialects. One of the major obstacles to doing semantic classification of multi-dialectal Arabic is the lack of a large, multi-dialectal, tagged corpus. To the best of our knowledge there are no automated processes for semantic classification of multi-dialectal Arabic social media texts.
We gather a data set of more than one million tweets collected from 449 accounts located in 12 Arabic-speaking countries. We group those tweets into 21,791 documents by country, account, and month. We first construct a query to represent a particular semantic concept. Then, using Latent Semantic Analysis (LSA) we rank the documents by semantic similarity to the query. Next, we use that ranking to train a deep neural network classifier to identify documents whose text is semantically similar to the query. Experiments demonstrate an overall accuracy of 98.075% and a positive accuracy of 88.178% have been achieved by this approach to semantic classification of multi-dialectal Arabic. The source code and the data set are provided on GitHub at https://github.com/therishel/ArabLeader.
Copyright
Tom Rishel, 2021
Recommended Citation
Rishel, Tom, "Semantic Classification of Multidialectal Arabic Social Media" (2021). Dissertations. 1892.
https://aquila.usm.edu/dissertations/1892