IP University - ML - Unit 2 Notes

Posted Jun 23, 2020

By Sharat Sachin

7 min read

This post gives a quick review on the second unit of the syllabus for the Machine Learning (ML) subject in the IP University syllabus. It is a continuation of the first post on the subject.

Logistic Regression

classification algorithm used to assign observations to a discrete set of classes
Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes
Types
- Binary (Pass/Fail)
- Multi (Cats, Dogs, Sheep)
- Ordinal (Low, Medium, High)

\[S(z)=\frac{1}{1+e^{−z}}\]

Cost function :

Perceptron

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class.

In short, a perceptron is a single-layer neural network. They consist of four main parts including input values, synaptic weights and bias, net sum, and an activation function.

Exponential Family

Generative and Discriminative models

training classifiers involve estimating \(f: x \rightarrow y\), or \(p(y\|x)\)
both of these models were generally used in supervised learning problems

Discriminative model

models the decision boundary between the classes
learns the conditional probability distribution \(p(y\|x)\)
they:
1. assume some functional form for \(p(y\|x)\)
2. estimate parameters of \(p(y\|x)\) directly from training data
examples :
- logistic regression
- scalar vector machine
- traditional neural networks
- nearest neighbour
- conditional random fields (CRF’s)

Generative model

explicitly models the actual distribution of each class
learns the joint probability distribution \(p(x,y)\)
predicts the conditional probability with the help of bayes theorem
they:
1. assume some functional form for \(p(y), p(x\|y)\)
2. estimate parameters of \(p(x\|y), p(y)\) directly from training data
3. use bayes rule to calculate \(p(y \|x)\)
in the end both of them is predicting the conditional probability \(p(y\|x)\)
examples
- naïve Bayes
- Bayesian networks
- Markov random fields
- hidden Markov models (HMM’s)

Gaussian Discriminant Analysis

explain generative models
there’s one part missing: how do we get \(P(x\|y)\)
GDA makes the assumption that
- \(P(x\|y)\) is distributed according to a Multivariate normal distribution
- \(P(y)\) is distributed according to a Bernoulli distribution
it means we know a little more about our data. With this additional help, we get better results than say, logistic regression
also means that GDA usually requires less data as compared to logistic regression
doesn’t work very well when our assumption doesn’t hold

Steps:

Train the data and obtain a discriminate fn, tells which class a data point has higher prob of belonging to
compute \(\mu\) and \(\sigma\) for each class, then calculate prob that data belongs to it, class with highest prob chosen

Naive Bayes

a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems
mainly used in text classification that includes a high-dimensional training dataset
simple and most effective Classification algorithms
probabilistic classifier, which means it predicts on the basis of the probability of an object
examples are spam filtration, sentimental analysis, and classifying articles
it assumes that the occurrence of a certain feature is independent of the occurrence of other features
uses Bayes theorem : \(p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\)

Working

Convert the given dataset into frequency tables.
Generate Likelihood table by finding the probabilities of given features.
Now, use Bayes theorem to calculate the posterior probability

eg. \(p(Y\|sun)={\frac {p(Y)p(sun\|Y)}{p(sun)}}, p(N\|sun)={\frac {p(N)p(sun\|N)}{p(sun)}}\)

Types

Gaussian:
- assumes that features follow a normal distribution
- if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution
Multinomial:
- used when the data is multinomial distributed
- used for document classification problems
- uses the frequency of words for the predictors
Bernoulli:
- works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables

Support Vector Machines

are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis
goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future
this decision boundary is called a hyperplane
a data point is viewed as a \(p\)-dimensional vector
we want to know whether we can separate such points with a \((p-1)\)-dimensional hyperplane - this is called a linear classifier
SVMs can efficiently also perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces

Hyperplane and Support vector

A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
there can be multiple lines/decision boundaries to segregate the classes in n-dimensional space
SVM algorithm finds the closest point of the lines from both the classes, these points are called support vectors
the distance between the vectors and the hyperplane is called as margin, the goal of SVM is to maximize this margin
the hyperplane with maximum margin is called the optimal hyperplane.

Types

Linear SVM
Non-linear SVM, by adding a new dimension, \(z = x^2 + y^2\)

Kernel

SVM algorithms use a set of mathematical functions that are defined as the kernel
function of kernel is to take data as input and transform it into the required form
different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid
most used type of kernel function is RBF. Because it has localized and finite response along the entire x-axis
kernel functions return the inner product between two points in a suitable feature space, thus defining a notion of similarity, with little computational cost even in very high-dimensional spaces
examples:
1. Polynomial
2. Gaussian
3. Gaussian radial basis function
4. Laplace RBF

Ensemble Models

Bagging

objective is to create several subsets of data from training sample chosen randomly with replacement
each collection of subset data is used to train their decision trees
we get an ensemble of different models
average of all the predictions from different trees are used which is more robust than a single decision tree classifier

Steps:

Suppose there are \(N\) observations and \(M\) features in training data set, sample from training data set is taken randomly with replacement
A subset of \(M\) features are selected randomly and whichever feature gives the best split is used to split the node iteratively
The tree is grown to the largest
Above steps are repeated \(n\) times and prediction is given based on the aggregation of predictions from \(n\) number of trees.

Advantages:

Reduces over-fitting of the model.
Handles higher dimensionality data very well.
Maintains accuracy for missing data.

Disadvantages:

Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the classification and regression model.

Boosting

used to create a collection of predictors
learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors
consecutive trees are fit and at every step, the goal is to improve the accuracy from the prior tree
when an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly
process converts weak learners into better performing model.

Steps:

Draw a random subset of training samples \(d1\) without replacement from the training set \(D\) to train a weak learner \(C1\)
Draw second random training subset \(d2\) without replacement from the training set and add \(50\) percent of the samples that were previously falsely classified/misclassified to train a weak learner \(C2\)
Find the training samples d3 in the training set D on which \(C1\) and \(C2\) disagree to train a third weak learner \(C3\)
Combine all the weak learners via majority voting.

Advantages

Supports different loss function
Works well with interactions.

Disadvantages

Prone to over-fitting.
Requires careful tuning of different hyper-parameters.

Adaboost

Weak models are added sequentially, trained using the weighted training data.
The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.
The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.
Once completed, you are left with a pool of weak learners each with a stage value.
A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes.
Predictions are made by calculating the weighted average of the weak classifiers.

ipu-notes, ml

ipu ml

This post is licensed under CC BY 4.0 by the author.

Logistic Regression

Perceptron

Exponential Family

Generative and Discriminative models

Discriminative model

Generative model

Gaussian Discriminant Analysis

Steps:

Naive Bayes

Working

Types

Support Vector Machines

Hyperplane and Support vector

Types

Kernel

Ensemble Models

Bagging

Steps:

Advantages:

Disadvantages:

Boosting

Steps:

Advantages

Disadvantages

Adaboost

Trending Tags