Evaluating a machine learning model in python.

Posted By : Mohit Sharma | 10-Sep-2018

Hi Guys, As we all know Machine learning is a trending technology with wide applications for business and education.
Here is the simplest example on how to write a simple classifier that classifies flower specifies.

The Dataset that we are going to use is Iris dataset and our programming language will be python.
Accuracy is a very crucial factor when it comes to model evaluation, it's important to a specific model for a specific type of problem. In my previous post, I've used Iris dataset so we'll be continuing with it.

Because in Machine learning our goal is always to create a model that generalizes.
Link to the dataset: https://archive.ics.uci.edu/ml/datasets/iris

Now we'll write our python code.

# import Iris dataset from sklearn 
from sklearn.datasets import load_iris
# import KNN classifier
from sklearn.neighbors import KNeighborsClassifier
# import model_selection to split dataset into multiple parts
from sklearn.model_selection import train_test_split
# import  linear_model  get LogisticRegression
from sklearn.linear_model import LogisticRegression
# import metrics from sklearn
from sklearn import metrics
# import accuracy_score metrics from sklearn
from sklearn.metrics import accuracy_score
# import Matplotlib (scientific plotting library)
import matplotlib.pyplot as plt



#load iris dataSet in a py variable
iris = load_iris()

# print dataset related infromation
print(iris.data.shape)

# split data into two sets to avoid overfitting problem and avoid learning noise in the data
# Our goal in ML is to always build the model the Generalizes or 
# 1. train split
# 2. test split
X_data = iris.data
y_target = iris.target
X_train,X_test,y_train,y_test = train_test_split(X_data,y_target,test_size=0.8,random_state=3)

# Check the sizes for both test and train dataset
#print(X_train.size,X_test.size,y_train.size,y_test.size)


# Now lets try out different types of models that we can use and 
# evaluate them with their training accuracy


#1) KNN K Nearest Neighbors: n_neighbors can be range from 1-25
k_range = list(range(1,26))
scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    # Train knn on our training dataset
    knn.fit(X_train,y_train);
    y_knn = knn.predict(X_train)
    scores.append(metrics.accuracy_score(y_knn, y_train))    


    
#2) Logistic Regression
logReg = LogisticRegression()
# Train LogReg on training dataset
logReg.fit(X_train,y_train)


# Let's measure the accuracy for both the models
y_logReg = logReg.predict(X_train)
print("Logistic Reg Accuracy: ", metrics.accuracy_score(y_train,y_logReg))

y_knn = knn.predict(X_train)
print("KNN Accuracy: ", metrics.accuracy_score(y_train,y_knn))


# plot the relationship between K and testing accuracy
plt.plot(k_range, scores)
plt.xlabel('Value of K for KNN')
plt.ylabel('Testing Accuracy')

After the evaluation, you'll find that some machine learning model performance may vary depending on what type of dataset you use and the kind of algorithm you are using.
In our case, each corresponding value of K will produce different accuracy.
Our goal here is to generalize the model also not to overtrain our model. Also with logistic regression will produce a different kind of accuracy. Sometimes our model can't differentiate between the actual data and noise which creates the problem of over-fitting the model which is opposite to generalizing the model.
To avoid such problem we use the train and test split which basically means split the data in 1:3 ratio (standard).
The Ratio can be tuned as per requirements and examining the performance.

Making above-mentioned decisions requires observation and you need to know your data.