Friday, January 27, 2017

While training your data set using the machine learning model model there is situation when your model learns details and noise in the training data to the extent and also the situation when that model neither model the training data or when it not generalizes to the new data.These situation negatively impacts the model ability to generalize.These situation are called overfitting and underfititng.
Let's see in details about them.

A. Overfitting: A hypothesis h is said to overfit the training data if  there is another hypothesis, hsuch that h has smaller error than h on the training data but h has larger error on the test data than h.
  Overfitting could result to a tree that can be more complex than necessary.It can no loger provide the good estimate of how well the tree will perform on the previously unseen records.It happens when model is capturing the the idiosyncrasies rather than generalities which is caused by too many parameter relative to the amount of training  data.





*To avoid such condition we do pruning which reduces the size of the decision tree by removing the section the tree which provide little or insignificant data.It reduces the complexity of final classifier and hence increases the predictive accuracy.

   There are method for which we can evaluate the subtrees to prune:

1. Minimum Description Length: In this we minimize the size of the tree and size of the misclassification tree

2. Cross validation: In this  the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data

There are two way to prune the decision tree  

1. Prepruning: Under the method we stop growing the tree when data split is not statically           significant.
      In this first we evaluate the splits before installing them.To do this we don't install the split that           don't look worthwhile.And when there is no worthwhile split then we are done.
      To decide at which node should we stop if all the instances belongs to the same class and when           attribute values are same.
      The more restrictive condition can be put like:
       i)  Stop is number of instances is less than some user specified threshold.
       ii) Stop if class distribution  of instances are independent of the available features.
       iii) Stop if expanding the current node does not improve the impurity measure.


2. Postpruning: Under this method we grow full tree and then post prune the tree.
       It is a cross validation approach.In which:
       i) We partition the data set into into grow set and validation set.
       ii) We build a complete tree for grow data
       iii) Run the loop until the accuracy on validation start decreasing for each leaf node in a                tree.Test the accuracy of the hypothesis on the validation set.Permanently prune the                node with the greatest increase in accuracy on the validation test.

B. Underfitting: under this condition model is not capable for generalizing for     the new data and it will have the poor performance on the training data.It is easy to for given good  performance metric.It can be avoided by just trying another machine learning algorithm.

Given below is the video in which i have explained about underfitting:

Hope you have enjoyed reading this article.In next article I will be discussing about instance based learning.Till then enjoy learning!!!

No comments:

Post a Comment