Date of Award

Summer 7-30-2023

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

School

Computing Sciences and Computer Engineering

Committee Chair

Dr. Chaoyang Zhang

Committee Chair School

Computing Sciences and Computer Engineering

Committee Member 2

Dr. Hong-Wen Deng

Committee Member 2 School

Biological, Environmental, and Earth Sciences

Committee Member 3

Dr. Zhaoxian Zhou

Committee Member 3 School

Computing Sciences and Computer Engineering

Committee Member 4

Dr. Bikramjit Banerjee

Committee Member 4 School

Computing Sciences and Computer Engineering

Committee Member 5

Dr. Bo Li

Committee Member 5 School

Computing Sciences and Computer Engineering

Abstract

The integration analyses of multi-omics data have the advantages of extending our understanding of biological system across multiple omics layers, unraveling the functional mechanism of complex disease development, and refining the discovery of novel drug targets. However, multi-omics studies often face challenges such as data heterogeneity, missing values problem, interpretability, and imbalance classes. Among these challenges, the missing values problem is a critical issue for large cohort studies as not all samples will get a complete measurement for all the omics layers. To address the problem of missing values in multi-omics data, I focused on the imputation of completely missing gene expression data from known genotype data. I first gave a comprehensive review for both single omics and multi-omics data imputation and divided the combination of different omics data into different strategies according to the central dogma of molecular biology. Then I implemented a one-dimensional convolutional autoencoder model for genotype imputation and improved the training process with a custom-defined training loop by using the single batch loss rather than the average loss over batches. Next, I developed the data preprocessing pipeline for both genotype and gene expression data on Louisiana osteoporosis study (LOS) cohort and made prediction for gene expression data from genotype data with the current PrediXcan method. To build a custom-defined prediction model, I trained the PrediXcan model following the PredictDB pipeline. Lastly, I modified a transformer (TF)-based sequence-to-expression prediction model originally developed for yeast to human genome. This modified TF model was used to make prediction for gene expression values from known genotype data on LOS and GEUVADIS (Genetic European Variation in Health and Disease) data. To compare the result between the TF model and the PrediXcan model, I selected Pearson correlation coefficient (PCC) and as evaluation metrics and used 5-fold cross validation to compare the imputation performance on the LOS and GEUVADIS data.

ORCID ID

0000-0003-4169-6684

Available for download on Friday, December 20, 2024

Share

COinS