Date of Award

Spring 5-2021

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

School

Computing Sciences and Computer Engineering

Committee Chair

Dr. Andrew Sung

Committee Chair School

Computing Sciences and Computer Engineering

Committee Member 2

Dr. Iliyan Iliev

Committee Member 2 School

Social Science and Global Studies

Committee Member 3

Dr. Bo Li

Committee Member 3 School

Computing Sciences and Computer Engineering

Committee Member 4

Dr. Ramakalavathi Marapareddy

Committee Member 4 School

Computing Sciences and Computer Engineering

Committee Member 5

Dr. Chaoyang Zhang

Committee Member 5 School

Computing Sciences and Computer Engineering

Abstract

Arabic is one of the most widely used languages in the world, but due in part to its morphological and syntactic richness, resources for automated processing of Arabic are relatively rare. Arabic takes three primary forms: Classical Arabic as seen in the Qur’an and other classical texts; Modern Standard Arabic (MSA) as seen in newspapers, formal documents, and other written text intended for widespread distribution; and dialectal Arabic as used in common speech and informal communication. Social media posts are often written in informal language and may include non-standard spellings, abbreviations, emoticons, hashtags, and emojis. Dialectal Arabic is commonly used in social media.

Semantic classification is the task of assigning a label to a text based on its primary semantic content. Given the increased use of dialectal Arabic on social media platforms in recent years, there is an urgent need for semantic classification of dialectal Arabic. Even compared to MSA there are few resources for automated processing of dialectal Arabic. The prior work dealing with automated processing of dialectal Arabic are limited to only one or two dialects. One of the major obstacles to doing semantic classification of multi-dialectal Arabic is the lack of a large, multi-dialectal, tagged corpus. To the best of our knowledge there are no automated processes for semantic classification of multi-dialectal Arabic social media texts.

We gather a data set of more than one million tweets collected from 449 accounts located in 12 Arabic-speaking countries. We group those tweets into 21,791 documents by country, account, and month. We first construct a query to represent a particular semantic concept. Then, using Latent Semantic Analysis (LSA) we rank the documents by semantic similarity to the query. Next, we use that ranking to train a deep neural network classifier to identify documents whose text is semantically similar to the query. Experiments demonstrate an overall accuracy of 98.075% and a positive accuracy of 88.178% have been achieved by this approach to semantic classification of multi-dialectal Arabic. The source code and the data set are provided on GitHub at https://github.com/therishel/ArabLeader.

Included in

Data Science Commons

Share

COinS