Finding the Critical Sampling of Big Datasets

Document Type

Conference Proceeding

Publication Date

5-15-2017

School

Computing Sciences and Computer Engineering

Abstract

© 2017 Copyright held by the owner/author(s). Publication rights licensed to Association for Computing Machinery. Big Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge extraction and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby two of the most important problems are raised when handling large datasets: sampling and feature selection. This paper addresses the sampling problem and presents a heuristic method to find the "critical sampling" of big datasets. The concept of the critical sampling size of a dataset D is that there is a minimum number of samples of D that is required for a given data analytic task to achieve satisfactory performance. The problem is very important in data mining, as the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the critical sampling size is intractable, in this paper we study heuristic methods to find the critical sampling. Several datasets were used to conduct experiments using three versions of the heuristic sampling method for evaluation. Preliminary results obtained have shown the existence of an apparent critical sampling size for all the datasets being tested, which is generally much smaller than the size of the whole dataset. Further, the proposed heuristic method provides a practical solution to find a useful critical sampling for data mining tasks.

Publication Title

ACM International Conference on Computing Frontiers 2017, CF 2017

First Page

355

Last Page

360

Share

COinS