A statistical appraisal of disproportional versus proportional microbial source tracking libraries
Library-based microbial source tracking (MST) can assist in reducing or eliminating fecal pollution in waters by predicting sources of fecal-associated bacteria. Library-based MST relies on an assembly of genetic or phenotypic "fingerprints" from pollution-indicative bacteria cultivated from known sources to compare with and identify fingerprints of unknown origin. The success of the library-based approach depends on how well each source candidate is represented in the library and which statistical algorithm or matching criterion is used to match unknowns. Because known source libraries are often built based on convenience or cost, some library sources may contain more representation than others. Depending on the statistical algorithm or matching criteria, predictions may become severely biased toward classifying unknowns into the library's dominant source category. We examined prediction bias for four of the most commonly used statistical matching algorithms in library-based MST when applied to disproportionately-represented known source libraries; maximum similarity (MS), average similarity (AS), discriminant analyses (DA), and k-means nearest neighbor (k-NN). MS was particularly sensitive to disproportionate source representation. AS and DA were more robust. k-NN provided a compromise between correct prediction and sensitivity to disproportional libraries including increased matching success and stability that should be considered when matching to disproportionally-represented libraries.