Critical Feature Selection and Critical Sampling for Data Mining
Computing Sciences and Computer Engineering
© 2018, Springer Nature Singapore Pte Ltd. The rapidly growing big data generated by connected sensors, devices, the web and social network platforms, etc., have stimulated the advancement of data science, which holds tremendous potential for problem solving in various domains. How to properly utilize the data in model building to obtain accurate analytics and knowledge discovery is a topic of great importance in data mining, and wherefore two issues arise: how to select a critical subset of features and how to select a critical subset of data points for sampling. This paper presents ongoing research that suggests: 1. the critical feature dimension problem is theoretically intractable, but simple heuristic methods may well be sufficient for practical purposes; 2. there are big data analytic problems where evidence suggest that the success of data mining depends more on the critical feature dimension than the specific features selected, thus a random selection of the features based on the dataset’s critical feature dimension will prove sufficient; and 3. The problem of critical sampling has the same intractable complexity as critical feature dimension, but again simple heuristic methods may well be practicable in most applications; experimental results with several versions of the heuristic method are presented and discussed. Finally, a set of metrics for data quality is proposed based on the concepts of critical features and critical sampling.
Communications in Computer and Information Science
(2018). Critical Feature Selection and Critical Sampling for Data Mining. Communications in Computer and Information Science, 844, 13-24.
Available at: https://aquila.usm.edu/fac_pubs/18176