Monday, February 20, 2017



Instance Based Learning is one way of solving task of approximating discrete or real valued target function.


* The key idea here is to:
     1. Just store the training examples.
     2. When the test example is given then find the closest  matches.
To do this we take an inductive assumption that:
   similar input maps to similar outputs.Under this 
     if it is not true then learning is impossible.
    If it is true then learning reduces to defining similar


For implementing instance based learning we use k nearest neighbour classification.For this approach:
1. Save the training training example.
2. then at prediction time find the k training example (x1,y1),…(xk,yk) that are closest to test              example x.
3. And then predict the most frequent class among yi’s.
 we can do various improvement in this like:
- Weighting the examples from the neighborhood
-  Measuring the "clossesness"
- finding the "close" example in large training set quickly.
 
For mathematical approach we use the following formulation:
 


here "k" is the number of nearest points for the given input.
Average of k points more reliable when:
noise in attributes
noise in class labels
classes partially overlap

While choosing the value of k in k nearest in neighbour the main issue is that how choose the value of k.For this we consider two situation 
          1. when "k" is large:
less sensitive to noise (particularly class noise)
better probability estimates for discrete classes
larger training sets allow larger values of k

          2. When " k" is small:
captures fine structure of problem space better
may be necessary with small training sets

* Balance must be struck between large and small value of k.
* As training set approaches infinity and k grows large,knn becomes Bayes optimal

But the tradeoff between large and small "k" can be difficult.
 To overcome this difficulty, use large k but more emphasis on nearer neighbor.

There are two ways of weighting in  kNN:

1.Distance-Weighted kNN:
   The formulation of this approach is:

2. Locally weighted averaging:
     The formulation of this approach is:

    
The algorithm which we have so far seen are strict averager i.e we can interpolate but we can't extrapolate.
So for extrapolation we do weighted regression,in this we are centered at test point ,weight is control by distance and kernel width.Local regressor can be linear, quadratic, n-th degree polynomial, neural net, …

Usually in instance based learning we come across situation which is known as Curse of Dimesionality.This is due to the following reasons:
as number of dimensions increases, distance between points becomes larger and more uniform
if number of relevant attributes is fixed, increasing the number of less relevant attributes may swamp distance
when more irrelevant than relevant dimensions, distance becomes less reliable
 The solution for this can be either we use large k or kernel width,feature selection,feature weight,more complex distance functions.Which is given by this following formula.


In the video given below you can learn more about instance based learning



Now we come to the end of this article.In the next article i will discuss about feature selection.Till then enjoy learning!!!



Friday, January 27, 2017

While training your data set using the machine learning model model there is situation when your model learns details and noise in the training data to the extent and also the situation when that model neither model the training data or when it not generalizes to the new data.These situation negatively impacts the model ability to generalize.These situation are called overfitting and underfititng.
Let's see in details about them.

A. Overfitting: A hypothesis h is said to overfit the training data if  there is another hypothesis, hsuch that h has smaller error than h on the training data but h has larger error on the test data than h.
  Overfitting could result to a tree that can be more complex than necessary.It can no loger provide the good estimate of how well the tree will perform on the previously unseen records.It happens when model is capturing the the idiosyncrasies rather than generalities which is caused by too many parameter relative to the amount of training  data.





*To avoid such condition we do pruning which reduces the size of the decision tree by removing the section the tree which provide little or insignificant data.It reduces the complexity of final classifier and hence increases the predictive accuracy.

   There are method for which we can evaluate the subtrees to prune:

1. Minimum Description Length: In this we minimize the size of the tree and size of the misclassification tree

2. Cross validation: In this  the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data

There are two way to prune the decision tree  

1. Prepruning: Under the method we stop growing the tree when data split is not statically           significant.
      In this first we evaluate the splits before installing them.To do this we don't install the split that           don't look worthwhile.And when there is no worthwhile split then we are done.
      To decide at which node should we stop if all the instances belongs to the same class and when           attribute values are same.
      The more restrictive condition can be put like:
       i)  Stop is number of instances is less than some user specified threshold.
       ii) Stop if class distribution  of instances are independent of the available features.
       iii) Stop if expanding the current node does not improve the impurity measure.


2. Postpruning: Under this method we grow full tree and then post prune the tree.
       It is a cross validation approach.In which:
       i) We partition the data set into into grow set and validation set.
       ii) We build a complete tree for grow data
       iii) Run the loop until the accuracy on validation start decreasing for each leaf node in a                tree.Test the accuracy of the hypothesis on the validation set.Permanently prune the                node with the greatest increase in accuracy on the validation test.

B. Underfitting: under this condition model is not capable for generalizing for     the new data and it will have the poor performance on the training data.It is easy to for given good  performance metric.It can be avoided by just trying another machine learning algorithm.

Given below is the video in which i have explained about underfitting:

Hope you have enjoyed reading this article.In next article I will be discussing about instance based learning.Till then enjoy learning!!!

Saturday, January 21, 2017

For constructing a good decision tree we need to be familiar with various terms like Information Gain,Entropy,Gain etc.These are useful in classifying the examples and chosing the next best attribute.
Let's see these terms in details:


Information Gain: measures how well a given attribute separates the training examples according to their target classification.This measure is used to select among the candidate attribute at each step while growing the tree.

Gain: It is the measurement of how well can we reduce the uncertainty(It's values lie between 0 and 1).

Entropy: It is the measure of uncertainty,purity and information content.

Information Theory: It is a optimal length code which assign  -log2p bits to the message having probability p.

S is sample of training examples in which:
     - p+ is the proportion of positive example in S.
     -  p-    is the proportion of negative examples in S

Entropy of S: It is an optimal number of bits to encode information about certainty/Uncertainty about S.It is given by this following formula;
          Entropy(S) = p+(-log2p+) + p-(-log2p-) = -p+log2p+- p-log2p

Gain(S,A): It is the expected reduction in entropy due to partitioning S on attribute A.
It is given by this following formula:
Gain(S,A)=Entropy(S) - åvÎvalues(A) |Sv|/|S| Entropy(Sv)
  

After terms which are useful in finding the attribute of next node in decision tree.Lte's see splitting rule which uses GINI Index.These are given the following formula:



There are two types of splitting based on continuous attributes:


For continuous attribute:
1. We partition the continuous value of attribute A into a discrete set of intervals.
2. We create a new attribute  Ac ,looking for a threshold c.

We choose the value of c by finding the best cut for all possible splits

In the video given below I have explain more about decision tree:

Hope you enjoy reading this article.In the next post I will be explaining about overfitting in machine learning. Till then enjoy learning!!!


Introduction to Decision Tree

a Decision Tree is a classifier in the form of a tree structure with two types of nodes:
     • Decision node: Specifies a choice or  test of some attribute, with one branch for each outcome
        
       • Leaf node: Indicates classification of an example.

The problem that occur is constructing a decision tree is:
  • Given a training examples what type of decision tree should be generated?
One of the proposal to over come such type of problem is to prefer the smallest tree which is consistent with such kind of data (i.e bias).


The possible method to do this is to search the space of the decision trees for the smallest decision tree that fits the data.

Let's see the example for constructing the decision tree for playing tennis:

For this we have:

Attribute and their values:
    1. Outlook : Sunny,Overcast,Rain
    2. Humidity: High,Normal
    3. Wind : Strong,Weak
    4. Temperature : Hot,Mild,Cold


Target Concept : Play Tennis- yes,no


The Decision Tree for playing Tennis :


In this decision tree :


If outlook is sunny,temperature is hot.humidity is high and wind is weak then there is chances that there is no game of tennis.




Decision tree is representation of disjuction of conjuction:
As in the case of above if we want to classify the target concept for YES.Then it's disjunction of conjuction will be :
(Outlook=Sunny Ù Humidity=Normal)   Ú        (Outlook=Overcast)  Ú     (Outlook=Rain Ù Wind=Weak)


For constructing a good decision tree:
1. First we stop and
      i)  Return a value of target feature or
      ii) A distribution over a target feature values
2. Choose a test(e.g an input feature) to split on.
      i) For each value on test,build a sub tree on those example with this value of the test 

We can use Top down induction of decision tree algorithm or ID3 algorithm for constructing a good decision tree.This algorithm proceeds as Follows:
1. We start with node  A¬ the "best" decision attribute of the next node.
2. Then we assign A as a decision attribute of the next node.
3. For each value of A we create a new Descendant.
4. Then we sort the training example to leaf node according to the attribute value of the                branch.
5.  If all the training examples are properly classified(i.e the same value of target attribute)          stop,else iterate over new leaf nodes         


We come across issue while constructing the decision tree regarding to the choices like:
1. When to stop:
     we stop when:
       i)   When there is no more input features.
       ii)  When all the examples classified.
       iii) When to few examples make an informative split.
2. Which test to split on:
        i)  On that split which gives the smallest error.
        ii) With multi valued feature 
        iii) When there is split on all values or
        iv) When the values split into half.


Given below is the video in which I have explained about decision tree:

                                      

Hope you enjoy reading this article.In next article I will be discussing more about decision tree.Till then enjoy learning!!!





Friday, January 20, 2017

Implementation of Linear Regression model

Now it's time for implementing our first  model of machine learning i.e Linear Regression
To do this we need various packages and libraries like
1. Numpy           : It is mainly use for N-dimensional array object.
2. Pandas           : It is python data analysis library including structures such as data frames.
3. Matplotlib     : it is 2D plotting library producing publication figure.
4. Scikit  Learn : It is Machine Learning algorithm used for data analysis and data mining task.

Anaconda is software which consist of these libraries.And it is powered by python and we use jupyter notebook which contain ipython note book in which we can do our programming implementation to machine learning.
In the video given below I have demonstrate step  by step  from where to download anaconda and how to run ipython notebook.

Now's let's see how to implement Linear Regression model of machine learning algorithm.
We have use simple function to implement our linear regression:

$$y = \frac{x}{2}+sin(x)+\epsilon$$

    where $\epsilon \sim \mathcal{N}(0,1)$ 
Given below is the piece of code in which i have import various libraries like:
numpy and for simplicity in coding i have abbreviated it as np.
I have also import linear_model which will import linear regression model of machine leaning from sklearn libraries and also data sets.
And to display our dataset set on plot I have import matplotlib and abbreviated as plt for simplicity in readability.

import numpy as np
from sklearn import linear_model, datasets, tree
import matplotlib.pyplot as plt
%matplotlib inline

Now I will prepare the data for the equation:


$$y = \frac{x}{2}+sin(x)+\epsilon$$


     where $\epsilon \sim \mathcal{N}(0,1)$ is a Gaussian Noise

In the following piece of code I have first taken 100 samples from the data sets.Then I have set it's range  from negative to positive value for the number of samples from the data sets and stored it in object x.
Then I have created a function and to provide randomness to data I ahve  called np.random.random().
Then I have use  plt.scatter() it will display the input datasets on the plot in scattered way and randomly and and each dot is indicated by black color.
then I have use plt.xlabel and plt.ylabel to label the x and y axis as x-input features and y-target values.And then plt.title to display title of the plot and the to display output I have use plt.show()


number_of_samples = 100
x = np.linspace(-np.pi, np.pi, number_of_samples)
y = 0.5*x+np.sin(x)+np.random.random(x.shape)
plt.scatter(x,y,color='black') #Plot y-vs-x in dots
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 1: Data for linear regression')
plt.show()

The output plot for this following code will be like this:

Now we will split our data set into training and test set.As it is always encouraged in machine learning to split the available data into trainingvalidation and test sets. 
The training set is supposed to be used to train the model. The model is evaluated on the validation set after every episode of training. 
The performance on the validation set gives a measure of how good the model generalizes
Various hyper parameters of the model are tuned to improve performance on the validation set. Finally when the model is completely optimized and ready for deployment, it is evaluated on the test data and the performance is reported in the final description of the model.
To do this I have split the dataset in the ratio of 70%,15%,15% of training test and validation test respectively


random_indices = np.random.permutation(number_of_samples)
#Training set
x_train = x[random_indices[:70]]
y_train = y[random_indices[:70]]
#Validation set
x_val = x[random_indices[70:85]]
y_val = y[random_indices[70:85]]
#Test set
x_test = x[random_indices[85:]]
y_test = y[random_indices[85:]]


Now we fit a line to our data.Linear regression learns to fit a hyperplane to our data in the feature space. For one dimensional data, the hyperplane reduces to a straight line. We will fit a line to our data using sklearn.linear_model.LinearRegression



model = linear_model.LinearRegression() #Create a least squared error linear regression object

#sklearn takes the inputs as matrices. Hence we reshape the arrays into column matrices
x_train_for_line_fitting = np.matrix(x_train.reshape(len(x_train),1))
y_train_for_line_fitting = np.matrix(y_train.reshape(len(y_train),1))

#Fit the line to the training data
model.fit(x_train_for_line_fitting, y_train_for_line_fitting)

#Plot the line
plt.scatter(x_train, y_train, color='black')
plt.plot(x.reshape((len(x),1)),model.predict(x.reshape((len(x),1))),color='blue')
plt.xlabel('x-input feature')
plt.ylabel('y-target values')
plt.title('Fig 2: Line fit to training data')

The output plot of the above code will be




As our model is ready,now we evaluate it. In a linear regression scenario, its common to evaluate the model in terms of the mean squared error on the validation and test sets.


mean_val_error = np.mean( (y_val - model.predict(x_val.reshape(len(x_val),1)))**2 )
mean_test_error = np.mean( (y_test - model.predict(x_test.reshape(len(x_test),1)))**2 )

print 'Validation MSE: ', mean_val_error, '\nTest MSE: ', mean_test_error


Now the output of the following code will be:

Validation MSE:  3.67954814357 
Test MSE:  4.96638767482





Now we come to the end of our first implementation of machine learning model.In the next article I will be explaining you about decision tree.Till then enjoy learning!!!

Wednesday, January 11, 2017

Linear Regression model in Machine Learning

There are many model which could be used to best fit the data.The simplest one is Linear Regression.




In this we fit the data which passes through these points.In liner regression we model the relationship between dependent variable y and independent variable x.
  These relationship are modeled using linear predictor function whose unknown model parameter are estimated from data.

The simple linear regression is given by the following equation:


Here ϵ is the Gaussian noise which is due to randomness of the data

Then plot will be:


Now our main objective is to how to learn the parameter.
We can find the parameter of the  equation of linear regression i.e:


where 
and 
 For multi variable linear regression it is given by the  following equation:



b0 is the intercept and  bj is the slope for the jth variable of the variable Xj.

Now to draw the line we select that one for which which the sum of squared error is as small as possible.
 The sum of squared error is given by:
 For multiple linear regression the hypothesis h(x) is given by:
Here the parameter for this equation is:

To learn this parameter we use Least Mean Square(LMS) algorithm:
  • First we make h(x) close to y, for the available training example.
  • Then we define the cost function  J(θ) as: 
  • Then we find θ that minimizes J(θ)

There are various ways of minimizing this cost function to get the
optimal solution.One of which is known as Gradient Descent.
To do this:

         1. First we start with initial guess of θ.
          2. Then we relatively update θ to make J(θ)  smaller until it converges to minima.


We update θ  well follow the following derivation:
Following this update rule the equation will become:


GGiven below is the video in which I have explained about linear regression model of machine learning algorithm:
Hope you have enjoyed reading this article.In next article i will be discussing about the platform on which i will be implementing this algorithm and how to implement it.Till then enjoy learning.