# IP University - ML - Unit 2 Notes

This post gives a quick review on the second unit of the syllabus for the Machine Learning (ML) subject in the IP University syllabus. It is a continuation of the first post on the subject.

## Logistic Regression

• classification algorithm used to assign observations to a discrete set of classes
• Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes
• Types
• Binary (Pass/Fail)
• Multi (Cats, Dogs, Sheep)
• Ordinal (Low, Medium, High)

$S(z)=\frac{1}{1+e^{−z}}$ Cost function :  ## Perceptron

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class.

In short, a perceptron is a single-layer neural network. They consist of four main parts including input values, synaptic weights and bias, net sum, and an activation function. ## Exponential Family   ## Generative and Discriminative models

• training classifiers involve estimating $$f: x \rightarrow y$$, or $$p(y|x)$$
• both of these models were generally used in supervised learning problems

#### Discriminative model

• models the decision boundary between the classes
• learns the conditional probability distribution $$p(y|x)$$
• they:
1. assume some functional form for $$p(y|x)$$
2. estimate parameters of $$p(y|x)$$ directly from training data
• examples :
• logistic regression
• scalar vector machine
• nearest neighbour
• conditional random fields (CRF’s)

#### Generative model

• explicitly models the actual distribution of each class
• learns the joint probability distribution $$p(x,y)$$
• predicts the conditional probability with the help of bayes theorem
• they:
1. assume some functional form for $$p(y), p(x|y)$$
2. estimate parameters of $$p(x|y), p(y)$$ directly from training data
3. use bayes rule to calculate $$p(y |x)$$
• in the end both of them is predicting the conditional probability $$p(y|x)$$
• examples
• naïve Bayes
• Bayesian networks
• Markov random fields
• hidden Markov models (HMM’s)

## Gaussian Discriminant Analysis

• explain generative models
• there’s one part missing: how do we get $$P(x|y)$$
• GDA makes the assumption that
• $$P(x|y)$$ is distributed according to a Multivariate normal distribution
• $$P(y)$$ is distributed according to a Bernoulli distribution
• it means we know a little more about our data. With this additional help, we get better results than say, logistic regression
• also means that GDA usually requires less data as compared to logistic regression
• doesn’t work very well when our assumption doesn’t hold

#### Steps:

1. Train the data and obtain a discriminate fn, tells which class a data point has higher prob of belonging to
2. compute $$\mu$$ and $$\sigma$$ for each class, then calculate prob that data belongs to it, class with highest prob chosen  ## Naive Bayes

• a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems
• mainly used in text classification that includes a high-dimensional training dataset
• simple and most effective Classification algorithms
• probabilistic classifier, which means it predicts on the basis of the probability of an object
• examples are spam filtration, sentimental analysis, and classifying articles
• it assumes that the occurrence of a certain feature is independent of the occurrence of other features
• uses Bayes theorem : $$p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}$$

#### Working

1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability

eg. $$p(Y|sun)={\frac {p(Y)p(sun|Y)}{p(sun)}}, p(N|sun)={\frac {p(N)p(sun|N)}{p(sun)}}$$

#### Types

1. Gaussian:
• assumes that features follow a normal distribution
• if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution
2. Multinomial:
• used when the data is multinomial distributed
• used for document classification problems
• uses the frequency of words for the predictors
3. Bernoulli:
• works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables

## Support Vector Machines

• are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis
• goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future
• this decision boundary is called a hyperplane
• a data point is viewed as a $$p$$-dimensional vector
• we want to know whether we can separate such points with a $$(p-1)$$-dimensional hyperplane - this is called a linear classifier
• SVMs can efficiently also perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces

#### Hyperplane and Support vector

• A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
• there can be multiple lines/decision boundaries to segregate the classes in n-dimensional space
• SVM algorithm finds the closest point of the lines from both the classes, these points are called support vectors
• the distance between the vectors and the hyperplane is called as margin, the goal of SVM is to maximize this margin
• the hyperplane with maximum margin is called the optimal hyperplane.

#### Types

1. Linear SVM 2. Non-linear SVM, by adding a new dimension, $$z = x^2 + y^2$$ #### Kernel

• SVM algorithms use a set of mathematical functions that are defined as the kernel
• function of kernel is to take data as input and transform it into the required form
• different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid
• most used type of kernel function is RBF. Because it has localized and finite response along the entire x-axis
• kernel functions return the inner product between two points in a suitable feature space, thus defining a notion of similarity, with little computational cost even in very high-dimensional spaces
• examples:
1. Polynomial 2. Gaussian 3. Gaussian radial basis function 4. Laplace RBF ## Ensemble Models

### Bagging • objective is to create several subsets of data from training sample chosen randomly with replacement
• each collection of subset data is used to train their decision trees
• we get an ensemble of different models
• average of all the predictions from different trees are used which is more robust than a single decision tree classifier

#### Steps:

1. Suppose there are $$N$$ observations and $$M$$ features in training data set, sample from training data set is taken randomly with replacement
2. A subset of $$M$$ features are selected randomly and whichever feature gives the best split is used to split the node iteratively
3. The tree is grown to the largest
4. Above steps are repeated $$n$$ times and prediction is given based on the aggregation of predictions from $$n$$ number of trees.

• Reduces over-fitting of the model.
• Handles higher dimensionality data very well.
• Maintains accuracy for missing data.

• Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the classification and regression model.

### Boosting • used to create a collection of predictors
• learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors
• consecutive trees are fit and at every step, the goal is to improve the accuracy from the prior tree
• when an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly
• process converts weak learners into better performing model.

#### Steps:

1. Draw a random subset of training samples $$d1$$ without replacement from the training set $$D$$ to train a weak learner $$C1$$
2. Draw second random training subset $$d2$$ without replacement from the training set and add $$50$$ percent of the samples that were previously falsely classified/misclassified to train a weak learner $$C2$$
3. Find the training samples d3 in the training set D on which $$C1$$ and $$C2$$ disagree to train a third weak learner $$C3$$
4. Combine all the weak learners via majority voting.

• Supports different loss function
• Works well with interactions.

• Prone to over-fitting.
• Requires careful tuning of different hyper-parameters. 