Sampling and Evaluating the Big Data for Knowledge Discovery

Document Type

Conference Proceeding

Publication Date

1-1-2016

School

Computing Sciences and Computer Engineering

Abstract

The era of Internet of Things and big data has seen individuals, businesses, and organizations increasingly rely on data for routine operations, decision making, intelligence gathering, and knowledge discovery. As the big data is being generated by all sorts of sources at accelerated velocity, in increasing volumes, and with unprecedented variety, it is also increasingly being traded as commodity in the new "data economy" for utilization. With regard to data analytics for knowledge discovery, this leads to the question, among various others, of how much data is really necessary and/or sufficient for getting the analytic results that will reasonably satisfy the requirements of an application. In this work-in-progress paper, we address the sampling problem in big data analytics and propose that (1) the problem of sampling the big data for analytics is "hard"-specifically, it is a theoretically intractable problem when formal measures are incorporated into performance evaluation; therefore, (2) heuristic, rather than algorithmic, methods are necessarily needed in data sampling, and a plausible heuristic method is proposed (3) a measure of dataset quality is proposed to facilitate the evaluation of the worthiness of datasets with respect to model building and knowledge discovery in big data analytics.

Publication Title

IoTBD 2016 - Proceedings of the International Conference on Internet of Things and Big Data

First Page

378

Last Page

382

Share

COinS