Date of Award
Summer 2020
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
School
Computing Sciences and Computer Engineering
Committee Chair
Dr. Chaoyang Zhang
Committee Chair School
Computing Sciences and Computer Engineering
Committee Member 2
Dr. Ping Gong
Committee Member 3
Dr. Dia Ali
Committee Member 3 School
Computing Sciences and Computer Engineering
Committee Member 4
Dr. Zhaoxian Zhou
Committee Member 4 School
Computing Sciences and Computer Engineering
Committee Member 5
Dr. Weihua Zhou
Abstract
In silico bioactivity prediction studies are designed to complement in vivo and in vitro efforts to assess the activity and properties of small molecules. In silico methods such as Quantitative Structure-Activity/Property Relationship (QSAR) are used to correlate the structure of a molecule to its biological property in drug design and toxicological studies. In this body of work, I started with two in-depth reviews into the application of machine learning based approaches and feature reduction methods to QSAR, and then investigated solutions to three common challenges faced in machine learning based QSAR studies.
First, to improve the prediction accuracy of learning from imbalanced data, Synthetic Minority Over-sampling Technique (SMOTE) and Edited Nearest Neighbor (ENN) algorithms combined with bagging as an ensemble strategy was evaluated. The Friedman’s aligned ranks test and the subsequent Bergmann-Hommel post hoc test showed that this method significantly outperformed other conventional methods. SMOTEENN with bagging became less effective when IR exceeded a certain threshold (e.g., >40). The ability to separate the few active compounds from the vast amounts of inactive ones is of great importance in computational toxicology.
Deep neural networks (DNN) and random forest (RF), representing deep and shallow learning algorithms, respectively, were chosen to carry out structure-activity relationship-based chemical toxicity prediction. Results suggest that DNN significantly outperformed RF (p < 0.001, ANOVA) by 22-27% for four metrics (precision, recall, F-measure, and AUPRC) and by 11% for another (AUROC).
Lastly, current features used for QSAR based machine learning are often very sparse and limited by the logic and mathematical processes used to compute them. Transformer embedding features (TEF) were developed as new continuous vector descriptors/features using the latent space embedding from a multi-head self-attention. The significance of TEF as new descriptors was evaluated by applying them to tasks such as predictive modeling, clustering, and similarity search. An accuracy of 84% on the Ames mutagenicity test indicates that these new features has a correlation to biological activity.
Overall, the findings in this study can be applied to improve the performance of machine learning based Quantitative Structure-Activity/Property Relationship (QSAR) efforts for enhanced drug discovery and toxicology assessments.
Copyright
Idakwo, 2020
Recommended Citation
Idakwo, Gabriel, "Machine Learning Approaches for Improving Prediction Performance of Structure-Activity Relationship Models" (2020). Dissertations. 1826.
https://aquila.usm.edu/dissertations/1826
Included in
Data Science Commons, Medicinal-Pharmaceutical Chemistry Commons, Other Chemistry Commons, Statistical Models Commons