Data Science

Data Preparation

What is Data Preparation?

Data preparation deals with processing raw data into refined state, in which information is ready for further analysis. The process of preparing data is an extremely important stage, as it ensures accuracy of the ensuing outcome. It is also one of the most time consuming stages of machine learning and business analysis.

There are 4 parts of data preparation: collection, cleansing, labelling, and visualising.

Collection

The first stage deals with collecting data. Data is everywhere and often stored in different locations (sources) and in different formats. Sources can include personal laptops, clouds, data warehouses and data lakes, applications, devices, and many more.

Cleansing

The second part of data preparation focuses on data quality. This process handles errors, missing values, and formats – fixing spelling errors, removing or filling missing values, and transforming values into readable and consistent formats (such as dates, measuring units, and currencies).

Labelling

The process of labelling refers to the act of identifying and providing relevant information (labels) to data. The additional information provides the datapoints with valuable context, which helps with further analysis or modelling on the dataset.

Visualising

The final part is visualisation, in which decisions concluded whether the data is correct and ready (validation). Using visualisation tools helps with seeing the data in a different perspective. Common visualisation tools are histograms, pie charts, scatter plots, bar charts, boxplots, and line plots.


Next: Data Analytics