Date of Award

Summer 8-2008

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Biological Sciences

Committee Chair

Mohammed Elasri

Committee Chair Department

Biological Sciences

Committee Member 2

Mac Alford

Committee Member 2 Department

Biological Sciences

Committee Member 3

Joe Zhang

Committee Member 3 Department

Biological Sciences

Committee Member 4

Jonathan Sun

Abstract

With more and more biological information generated, the most pressing task of bioinformatics has become to analyze and interpret various types of data, including nucleotide and amino acid sequences, protein structures, gene expression profiling and so on. In this dissertation, we apply the data mining techniques of feature generation, feature selection, and feature integration with learning algorithms to tackle the problems of disease phenotype classification, clinical outcome and patient survival prediction from gene expression profiles.

We analyzed the effect of batch noise in microarray data on the performance of classification. Batchmatch, a batch adjusting algorithm based on double scaling method is advantageous over Combat, another batch correcting algorithm based on the empirical bayes frame work. In order to identify genes associated with disease phenotype classification or patient survival prediction from gene expression data, we compared and analyzed the performance of five feature selection algorithms. Our observations from these studies indicated that Gainratio algorithm performs better and more consistently over the other algorithms studied.

When it comes to performance metric to choose the best classifiers, MCC gives unbiased performance results over accuracy in some endpoints, where class imbalance is more. In the aspect of classification algorithms, no single algorithm is absolutely superior to all others, though SVM achieved fairly good results in most endpoints. Naive bayes algorithm also performed well in some endpoints. Overall, from the total 65 models we reported (5 top models for 13 end points) SVM and SMO (a variant of SVM) dominate mostly, also the linear kernel performed well over RBF in our binary classifications.

Share

COinS