Sampling and Evaluating the Big Data for Knowledge Discovery
Document Type
Conference Proceeding
Publication Date
1-1-2016
School
Computing Sciences and Computer Engineering
Abstract
The era of Internet of Things and big data has seen individuals, businesses, and organizations increasingly rely on data for routine operations, decision making, intelligence gathering, and knowledge discovery. As the big data is being generated by all sorts of sources at accelerated velocity, in increasing volumes, and with unprecedented variety, it is also increasingly being traded as commodity in the new "data economy" for utilization. With regard to data analytics for knowledge discovery, this leads to the question, among various others, of how much data is really necessary and/or sufficient for getting the analytic results that will reasonably satisfy the requirements of an application. In this work-in-progress paper, we address the sampling problem in big data analytics and propose that (1) the problem of sampling the big data for analytics is "hard"-specifically, it is a theoretically intractable problem when formal measures are incorporated into performance evaluation; therefore, (2) heuristic, rather than algorithmic, methods are necessarily needed in data sampling, and a plausible heuristic method is proposed (3) a measure of dataset quality is proposed to facilitate the evaluation of the worthiness of datasets with respect to model building and knowledge discovery in big data analytics.
Publication Title
IoTBD 2016 - Proceedings of the International Conference on Internet of Things and Big Data
First Page
378
Last Page
382
Recommended Citation
Sung, A.,
Ribeiro, B.,
Liu, Q.
(2016). Sampling and Evaluating the Big Data for Knowledge Discovery. IoTBD 2016 - Proceedings of the International Conference on Internet of Things and Big Data, 378-382.
Available at: https://aquila.usm.edu/fac_pubs/19738
COinS