Introduction to Data Preprocessing and Machine Learning

Posted By : Mohit Sharma | 31-Dec-2018

It is one of the most fundamental and important parts, even it is one of the most time-consuming parts of machine learning.

Since without properly formatted data it can't be used. The data that we usually get is raw which cann't be used directly to train the model.

We need to preprocess the data so the model accurately predicts the output.

The real world data are generally noisy i.e. contains errors and outliers, inconsistent and incomplete.

Because of these reasons, data preprocessing is an important part of machine learning.

Types of data preprocessing:

1. Data cleaning

2. Data Reduction.

Data Cleaning

1. Fill the missing values

a. Ignore the tuple.

b. Use the mean to fill the missing the value.

c. Predict the missing values using various learning algorithms.

2. Smooth out noisy data (using binning).

Example:

Bin1: 3,6, 15

Bin2: 19,20, 24

Bin3: 29,34, 42

Smoothing by bin means:

Bin1: 8,8,8

Bin2: 21,21,21

Bin3: 35,35,35

Smoothing by bin boundaries:

Bin1:3,3,15

Bin2:19,19,24

Bin3:29,29,42

3. Outliers removal methods

a. Clustering

b. Curvefitting

c. Hypothesis-testing is given the model.

In Python, We can use the Pandas library for any kind of manipulation of the data.

Data Reduction

Sometimes data we get is more than required to we need to apply different techniques to reduce its size.

1. Reducing the number of attributes.

a. Data cube aggregation: applying roll-up, slice, dice operations.

b. Removing irrelevant attributes.

2. Reducing the number of attribute values

a. Clustering group values in clusters.

b. Aggregation or generalization.

3. Reducing the number of tuples.

a. Sampling.

Once, we collect data with concise attributes, then it'll be ready to be used in a machine learning model directly.
Now you can see performance, accuracy and generalization of a machine learning model is highly depend on the kind of input data, as we discussed intially data preprocessing in a very important and should not be overlooked with trying to make a machine learning program.

Thank you