DATA PREPARATION ON LARGE DATASETS FOR DATA SCIENCE

Darshan Barapatre; Vijayalakshmi A

doi:10.22159/ajpcr.2017.v10s1.20526

Authors

Darshan Barapatre School of Computing Sciences and Engineering, VIT University, Chennai, India.
Vijayalakshmi A School of Computing Sciences and Engineering VIT University, Chennai, India.

DOI:

https://doi.org/10.22159/ajpcr.2017.v10s1.20526

Keywords:

Data preparation, Data mining, Machine learning, Map reduce, SPARK, Apache pig, Apache oozie

Abstract

Â According to interviews and experts, data scientists spend 50-80% of the valuable time in the mundane task of collecting and preparing structured or unstructured data, before it can be explored for useful analysis. It is very valuable for a data scientist to restructure and refine the data into more meaningful datasets, which can be used further for analytics. Hence, the idea is to build a tool which will contain all the required data preparation techniques to make data well-structured by providing greater flexibility and easy to use UI. Tool will contain different data preparation techniques which will include the process of data cleaning, data structuring, transforming data, data compression, and data profiling and implementation of related machine learning algorithms.

Downloads

Download data is not yet available.

References

Han J, Kamber M. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann; 2006.

Pyle D. Data Preparation for Data Mining. San Francisco: The Morgan Kaufmann Series in Data Management Systems; 1999.

Wu X, Kumar V. Survey Paper on Top 10 Algorithms in Data Mining. London: Springer-Verlag Limited; 2007.

Ghosh PK. Big Data ETL and Utilities for Hadoop Map Reduce. Available from: https://www.github.com/pranab/chombo.

Extract, Transform, and Load Big Data with Apache Hadoop-By Intel, White Paper, Big data Analytics; 2013.

Abiteboul S, Clue S, Milo T, Mogilevsky P, Simeon J. Tools for data translation and integration. IEEE Data Eng Bull 1999;26:3-8.

Rana S, Negi GP, Kapoor K. A study over data cleansing and its tools. International Journal of Advanced Research in Computer Science and Software Engineering Research Paper; 2016.

Meng X, Bosagh-Zadeh R, Ulanov A, Yavuz B, Pu L, Venkataraman S, et al. MLlib: Machine learning in apache spark. J Mach Learn Res 2016;17(34):1â€“7.

LaValle S, Lesser E, Shockley R, Hopkins MS, Kruschwitz N. MIT Sloan study; results published in MIT Sloan Management Review. Big Data, Analytics and the Path from Insights to Value, December, 21; 2010.

Liu A. Apache Spark Machine Learning Blueprints. IBMâ€™s Leading Experts in Big Data Analytics; 2016.

Bandugula N. The-5-Minute-Guide-Understanding-Significance- Apache-Spark by Senior Product Manager. MAPR; 2015.