Factorial Analysis of Error Correction Performance Using Simulated Next-Generation Sequencing Data

Document Type

Conference Proceeding

Publication Date

1-17-2017

School

Computing Sciences and Computer Engineering

Abstract

© 2016 IEEE. Error correction is a critical initial step in next-generation sequencing (NGS) data analysis. Although more than 60 tools have been developed, there is no systematic evidence-based comparison with regard to their strength and weakness, especially in terms of correction accuracy. Here we report a full factorial simulation study to examine how NGS dataset characteristics (genome size, coverage depth and read length in particular) affect error correction performance (precision and F-score), as well as to compare performance sensitivity/resistance of six k-mer spectrum-based methods to variations in dataset characteristics. Multi-way ANOVA tests indicate that choice of correction method and dataset characteristics had significant effects on performance metrics. Overall, BFC, Bless, Bloocoo and Musket performed better than Lighter and Trowel on 27 synthetic datasets. For each chosen method, read length and coverage depth showed more pronounced impact on performance than genome size. This study shed insights to the performance behavior of error correction methods in response to the common variables one would encounter in real-world NGS datasets. It also warrants further studies of wet lab-generated experimental NGS data to validate findings obtained from this simulation study.

Publication Title

Proceedings - 2016 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2016

First Page

1164

Last Page

1169

Find in your library

Share

COinS