Overfitting and Underfitting With Machine Learning Algorithms

Posted By : Pawan Sunar | 27-Aug-2019

Overfitting and Underfitting With Machine Learning Algorithms

A model is said to be a good machine learning model if it generalizes any new input data from the problem domain in an exceedingly proper way. This helps us to make predictions in the future data, that data model has ne'er seen.

Now, suppose we wish to examine however well our machine learning model learns and generalizes to the new data. For that, we've overfitting and underfitting, which are majorly responsible for the poor performances of the machine learning algorithms.

Underfitting:

A statistical model or a machine learning algorithm is said to be underfitting when it cannot capture the underlying trend of the data. It destroys the accuracy of our machine learning model. Its prevalence merely means our model or the algorithm doesn't work the data to an adequate degree. It always happens when we have fewer data to build an accurate model and additionally once we attempt to build a linear model with a piece non-linear data. In such cases, the rules of the machine learning model are too simple and versatile to be applied on such minimal data and so the model can most likely make a lot of wrong predictions. Underfitting can be avoided by providing more data and also reducing the features by feature choice.

Overfitting:

A statistical model is said to be overfitted when we train it with lots of data. Once a model gets trained with such a lot of data, it starts learning from the noise and inaccurate data entries in our dataset. Then the model doesn't categorize the data properly, due to an excessive amount of details and noise. The causes of overfitting are non-parametric and non-linear ways as a result of these styles of machine learning algorithms have more freedom in building the model based on the dataset and so they will extremely build unreasonable models. A solution to avoid overfitting is by implementing a linear algorithm if we've linear data or using the parameters just like the maximal depth if we are using decision trees.

How to avoid Overfitting:

The normally used methodologies are:

Cross-Validation: A standard way to find out-of-sample prediction error is to use 5-fold cross-validation.

Early Stopping: Its rules provide us the guidance on what number of iterations are often run before the learner begins to over-fit.

Pruning: Pruning is a technique associated with decision trees that are extensively used while building-related models. It merely reduces the nodes that do not provide predictive power to classify instances.

Regularization: It normalizes and moderates weights attached to a feature or a neuron so that the machine learning algorithms do not rely on just a few features or neurons to predict the result.

How to tackle Underfitting:

Although underfitting is relatively ascertained lesser in machine learning models, it shouldn't be overlooked.

The following ways can be used to tackle underfitting:

Increase the scale or range of parameters within the machine learning model.
Increase the quality or type of model.
Increasing the training time until the cost function in machine learning is minimized.

Good Fit in Statistical Model:

Ideally, the case once the model makes the predictions with zero error, is said to have a good fit on the data. This example is achievable at a spot between overfitting and underfitting. So as to grasp it we'll get to scrutinize the performance of our model with the passage of time, while it's learning from the training dataset.

With the passage of time, our model will keep learning and thus the error for the model on the training and testing data keep on decreasing. If it'll learn for too long, the model can become more liable to overfitting because of the presence of noise and fewer useful details. Therefore the performance of our model will decrease. So as to induce a good fit, we'll stop at some extent simply before wherever the error starts increasing. At this point, the model will have good skills in training dataset as well as on testing dataset.

Reference:

lasseschultebraucks.com/overfitting-underfitting-ml/

chunml.github.io/ChunML.github.io/tutorial/Underfit-Overfit/

chemicalstatistician.wordpress.com/2014/03/19/machine-learning-lesson-of-the-day-overfitting-and-underfitting/