Date of Award
Summer 7-2023
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
School
Computing Sciences and Computer Engineering
Committee Chair
Dr. Chaoyang Zhang
Committee Chair School
Computing Sciences and Computer Engineering
Committee Member 2
Dr. Hong-Wen Deng
Committee Member 2 School
Biological, Environmental, and Earth Sciences
Committee Member 3
Dr. Zhaoxian Zhou
Committee Member 3 School
Computing Sciences and Computer Engineering
Committee Member 4
Dr. Bikramjit Banerjee
Committee Member 4 School
Computing Sciences and Computer Engineering
Committee Member 5
Dr. Bo Li
Committee Member 5 School
Computing Sciences and Computer Engineering
Abstract
The integration analyses of multi-omics data have the advantages of extending our understanding of biological system across multiple omics layers, unraveling the functional mechanism of complex disease development, and refining the discovery of novel drug targets. However, multi-omics studies often face challenges such as data heterogeneity, missing values problem, interpretability, and imbalance classes. Among these challenges, the missing values problem is a critical issue for large cohort studies as not all samples will get a complete measurement for all the omics layers. To address the problem of missing values in multi-omics data, I focused on the imputation of completely missing gene expression data from known genotype data. I first gave a comprehensive review for both single omics and multi-omics data imputation and divided the combination of different omics data into different strategies according to the central dogma of molecular biology. Then I implemented a one-dimensional convolutional autoencoder model for genotype imputation and improved the training process with a custom-defined training loop by using the single batch loss rather than the average loss over batches. Next, I developed the data preprocessing pipeline for both genotype and gene expression data on Louisiana osteoporosis study (LOS) cohort and made prediction for gene expression data from genotype data with the current PrediXcan method. To build a custom-defined prediction model, I trained the PrediXcan model following the PredictDB pipeline. Lastly, I modified a transformer (TF)-based sequence-to-expression prediction model originally developed for yeast to human genome. This modified TF model was used to make prediction for gene expression values from known genotype data on LOS and GEUVADIS (Genetic European Variation in Health and Disease) data. To compare the result between the TF model and the PrediXcan model, I selected Pearson correlation coefficient (PCC) and as evaluation metrics and used 5-fold cross validation to compare the imputation performance on the LOS and GEUVADIS data.
ORCID ID
0000-0003-4169-6684
Copyright
Meng Song
Recommended Citation
song, meng, "MISSING VALUE IMPUTATION FOR SINGLE OMICS AND MULTI-OMICS DATA" (2023). Dissertations. 2166.
https://aquila.usm.edu/dissertations/2166
Included in
Bioinformatics Commons, Computational Biology Commons, Computational Engineering Commons, Data Science Commons