Date of Award

Fall 12-1-2017

Degree Type

Dissertation

Degree Name

Doctor of Philosophy (PhD)

Department

Computing

School

Computing Sciences and Computer Engineering

Committee Chair

Chaoyang Zhang

Committee Chair Department

Computing

Committee Member 2

Ping Gong

Committee Member 2 Department

Computing

Committee Member 3

Alex Flynt

Committee Member 3 Department

Biological Sciences

Committee Member 4

Nan Wang

Committee Member 4 Department

Computing

Committee Member 5

Zheng Wang

Committee Member 5 Department

Computing

Committee Member 6

Wonryull Koh

Committee Member 6 Department

Computing

Abstract

Tremendous evolvement in sequencing technologies and the vast availability of data due to decreasing cost of Next-Generation-Sequencing (NGS) has availed scientists the opportunity to address a wide variety of evolutionary and biological issues. NGS uses massively parallel technology to accelerate the process at the expense of accuracy and read length in comparison to earlier Sanger methods. Therefore, computational limitations exist in how much analysis and information can be gleaned from the data without performing some form of error correction.

Error correction process is laborious and consumes a lot of computational resources. Despite the existence of many NGS data error correction methods, the false positive rate of correction is still quite high while the amount of computational resources consumed is not declining even with improved algorithms. Until now, many error correction algorithms still use bloom filter as their underlying data structure and a comprehensive downstream analysis of a novel organism upon error correction does not currently exist.

With Illumina sequencing being the most popular and most widely used sequencing technique, this dissertation focuses mostly on correcting Illumina based data. We first describe the characteristics of errors in NGS data and the algorithms implemented so far in mitigating these errors. A methodology was presented to investigate error correction given a range of both real and experimental NGS data with specific attention to substitution, insertion, and deletion errors

Secondly, a comprehensive comparative and statistical comparison of these error correction methods was conducted to discern the effects of NGS data properties like genome size, read length, genome coverage depth and correction algorithm on the number of errors that can be corrected. Based on the results of our investigation, we developed a web based workflow called BECOW, a Bioinformatics Error Correction Workflow, which will allow error correction of NGS data over the internet without the need for prior knowledge of command line language.

Third, a novel error correction algorithm, Cuckoo Filter-based Error Correction of Next-generation Data (CECOND), with cuckoo filter as its underlying data structure, was then introduced. Cuckoo filter is based on cuckoo hash table used to dynamically test approximate set membership in O (1) time. By storing items fingerprints, space is maximized leading to a reduction in computational resource consumption. It also results in low false positive (>3%) rates, better than >4% reported by existing methods, are obtained after error correction.

Finally, error corrected timber rattlesnake (Crotalus horridus) data was used to generate de novo draft genome assembly and compared with those generated using other methods. The assembly comparison results proved that error corrected data is desired for qualitative draft genome assembly to be achieved.

Share

COinS