Date of Award
Fall 12-2017
Degree Type
Dissertation
Degree Name
Doctor of Philosophy (PhD)
Department
Computing
School
Computing Sciences and Computer Engineering
Committee Chair
Chaoyang Zhang
Committee Chair Department
Computing
Committee Member 2
Ping Gong
Committee Member 2 Department
Computing
Committee Member 3
Alex Flynt
Committee Member 3 Department
Biological Sciences
Committee Member 4
Nan Wang
Committee Member 4 Department
Computing
Committee Member 5
Zheng Wang
Committee Member 5 Department
Computing
Committee Member 6
Wonryull Koh
Committee Member 6 Department
Computing
Abstract
Tremendous evolvement in sequencing technologies and the vast availability of data due to decreasing cost of Next-Generation-Sequencing (NGS) has availed scientists the opportunity to address a wide variety of evolutionary and biological issues. NGS uses massively parallel technology to accelerate the process at the expense of accuracy and read length in comparison to earlier Sanger methods. Therefore, computational limitations exist in how much analysis and information can be gleaned from the data without performing some form of error correction.
Error correction process is laborious and consumes a lot of computational resources. Despite the existence of many NGS data error correction methods, the false positive rate of correction is still quite high while the amount of computational resources consumed is not declining even with improved algorithms. Until now, many error correction algorithms still use bloom filter as their underlying data structure and a comprehensive downstream analysis of a novel organism upon error correction does not currently exist.
With Illumina sequencing being the most popular and most widely used sequencing technique, this dissertation focuses mostly on correcting Illumina based data. We first describe the characteristics of errors in NGS data and the algorithms implemented so far in mitigating these errors. A methodology was presented to investigate error correction given a range of both real and experimental NGS data with specific attention to substitution, insertion, and deletion errors
Secondly, a comprehensive comparative and statistical comparison of these error correction methods was conducted to discern the effects of NGS data properties like genome size, read length, genome coverage depth and correction algorithm on the number of errors that can be corrected. Based on the results of our investigation, we developed a web based workflow called BECOW, a Bioinformatics Error Correction Workflow, which will allow error correction of NGS data over the internet without the need for prior knowledge of command line language.
Third, a novel error correction algorithm, Cuckoo Filter-based Error Correction of Next-generation Data (CECOND), with cuckoo filter as its underlying data structure, was then introduced. Cuckoo filter is based on cuckoo hash table used to dynamically test approximate set membership in O (1) time. By storing items fingerprints, space is maximized leading to a reduction in computational resource consumption. It also results in low false positive (>3%) rates, better than >4% reported by existing methods, are obtained after error correction.
Finally, error corrected timber rattlesnake (Crotalus horridus) data was used to generate de novo draft genome assembly and compared with those generated using other methods. The assembly comparison results proved that error corrected data is desired for qualitative draft genome assembly to be achieved.
Copyright
2017, Isaac Akogwu
Recommended Citation
Akogwu, Isaac, "Development, Evaluation, and Application of a Novel Error Correction Method for Next Generation Sequencing Data" (2017). Dissertations. 1453.
https://aquila.usm.edu/dissertations/1453
Included in
Bioinformatics Commons, Biotechnology Commons, Computational Biology Commons, Genomics Commons, Other Genetics and Genomics Commons, Systems Biology Commons