IP University  ML  Unit 2 Notes
23 Jun 2020This post gives a quick review on the second unit of the syllabus for the Machine Learning (ML) subject in the IP University syllabus. It is a continuation of the first post on the subject.
Logistic Regression
 classification algorithm used to assign observations to a discrete set of classes
 Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes
 Types
 Binary (Pass/Fail)
 Multi (Cats, Dogs, Sheep)
 Ordinal (Low, Medium, High)
\[S(z)=\frac{1}{1+e^{−z}}\]
Cost function :
Perceptron
A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class.
In short, a perceptron is a singlelayer neural network. They consist of four main parts including input values, synaptic weights and bias, net sum, and an activation function.
Exponential Family
Generative and Discriminative models
 training classifiers involve estimating \(f: x \rightarrow y\), or \(p(yx)\)
 both of these models were generally used in supervised learning problems
Discriminative model
 models the decision boundary between the classes
 learns the conditional probability distribution \(p(yx)\)
 they:
 assume some functional form for \(p(yx)\)
 estimate parameters of \(p(yx)\) directly from training data
 examples :
 logistic regression
 scalar vector machine
 traditional neural networks
 nearest neighbour
 conditional random fields (CRF’s)
Generative model
 explicitly models the actual distribution of each class
 learns the joint probability distribution \(p(x,y)\)
 predicts the conditional probability with the help of bayes theorem
 they:
 assume some functional form for \(p(y), p(xy)\)
 estimate parameters of \(p(xy), p(y)\) directly from training data
 use bayes rule to calculate \(p(y x)\)
 in the end both of them is predicting the conditional probability \(p(yx)\)
 examples
 naïve Bayes
 Bayesian networks
 Markov random fields
 hidden Markov models (HMM’s)
Gaussian Discriminant Analysis
 explain generative models
 there’s one part missing: how do we get \(P(xy)\)
 GDA makes the assumption that
 \(P(xy)\) is distributed according to a Multivariate normal distribution
 \(P(y)\) is distributed according to a Bernoulli distribution
 it means we know a little more about our data. With this additional help, we get better results than say, logistic regression
 also means that GDA usually requires less data as compared to logistic regression
 doesn’t work very well when our assumption doesn’t hold
Steps:
 Train the data and obtain a discriminate fn, tells which class a data point has higher prob of belonging to
 compute \(\mu\) and \(\sigma\) for each class, then calculate prob that data belongs to it, class with highest prob chosen
Naive Bayes
 a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems
 mainly used in text classification that includes a highdimensional training dataset
 simple and most effective Classification algorithms
 probabilistic classifier, which means it predicts on the basis of the probability of an object
 examples are spam filtration, sentimental analysis, and classifying articles
 it assumes that the occurrence of a certain feature is independent of the occurrence of other features
 uses Bayes theorem : \(p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\)
Working
 Convert the given dataset into frequency tables.
 Generate Likelihood table by finding the probabilities of given features.
 Now, use Bayes theorem to calculate the posterior probability
eg. \(p(Ysun)={\frac {p(Y)p(sunY)}{p(sun)}}, p(Nsun)={\frac {p(N)p(sunN)}{p(sun)}}\)
Types
 Gaussian:
 assumes that features follow a normal distribution
 if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution
 Multinomial:
 used when the data is multinomial distributed
 used for document classification problems
 uses the frequency of words for the predictors
 Bernoulli:
 works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables
Support Vector Machines
 are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis
 goal of the SVM algorithm is to create the best line or decision boundary that can segregate ndimensional space into classes so that we can easily put the new data point in the correct category in the future
 this decision boundary is called a hyperplane
 a data point is viewed as a \(p\)dimensional vector
 we want to know whether we can separate such points with a \((p1)\)dimensional hyperplane  this is called a linear classifier
 SVMs can efficiently also perform a nonlinear classification using what is called the kernel trick, implicitly mapping their inputs into highdimensional feature spaces
Hyperplane and Support vector
 A hyperplane in an ndimensional Euclidean space is a flat, n1 dimensional subset of that space that divides the space into two disconnected parts.
 there can be multiple lines/decision boundaries to segregate the classes in ndimensional space
 SVM algorithm finds the closest point of the lines from both the classes, these points are called support vectors
 the distance between the vectors and the hyperplane is called as margin, the goal of SVM is to maximize this margin
 the hyperplane with maximum margin is called the optimal hyperplane.
Types

Linear SVM

Nonlinear SVM, by adding a new dimension, \(z = x^2 + y^2\)
Kernel
 SVM algorithms use a set of mathematical functions that are defined as the kernel
 function of kernel is to take data as input and transform it into the required form
 different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid
 most used type of kernel function is RBF. Because it has localized and finite response along the entire xaxis
 kernel functions return the inner product between two points in a suitable feature space, thus defining a notion of similarity, with little computational cost even in very highdimensional spaces
 examples:
 Polynomial
 Gaussian
 Gaussian radial basis function
 Laplace RBF
Ensemble Models
Bagging
 objective is to create several subsets of data from training sample chosen randomly with replacement
 each collection of subset data is used to train their decision trees
 we get an ensemble of different models
 average of all the predictions from different trees are used which is more robust than a single decision tree classifier
Steps:
 Suppose there are \(N\) observations and \(M\) features in training data set, sample from training data set is taken randomly with replacement
 A subset of \(M\) features are selected randomly and whichever feature gives the best split is used to split the node iteratively
 The tree is grown to the largest
 Above steps are repeated \(n\) times and prediction is given based on the aggregation of predictions from \(n\) number of trees.
Advantages:
 Reduces overfitting of the model.
 Handles higher dimensionality data very well.
 Maintains accuracy for missing data.
Disadvantages:
 Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the classification and regression model.
Boosting
 used to create a collection of predictors
 learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors
 consecutive trees are fit and at every step, the goal is to improve the accuracy from the prior tree
 when an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly
 process converts weak learners into better performing model.
Steps:
 Draw a random subset of training samples \(d1\) without replacement from the training set \(D\) to train a weak learner \(C1\)
 Draw second random training subset \(d2\) without replacement from the training set and add \(50\) percent of the samples that were previously falsely classified/misclassified to train a weak learner \(C2\)
 Find the training samples d3 in the training set D on which \(C1\) and \(C2\) disagree to train a third weak learner \(C3\)
 Combine all the weak learners via majority voting.
Advantages
 Supports different loss function
 Works well with interactions.
Disadvantages
 Prone to overfitting.
 Requires careful tuning of different hyperparameters.
Adaboost
 Weak models are added sequentially, trained using the weighted training data.
 The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.
 The process continues until a preset number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.
 Once completed, you are left with a pool of weak learners each with a stage value.
 A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes.
 Predictions are made by calculating the weighted average of the weak classifiers.