IP University - ML - Unit 2 Notes
This post gives a quick review on the second unit of the syllabus for the Machine Learning (ML) subject in the IP University syllabus. It is a continuation of the first post on the subject.
Logistic Regression
- classification algorithm used to assign observations to a discrete set of classes
- Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes
- Types
- Binary (Pass/Fail)
- Multi (Cats, Dogs, Sheep)
- Ordinal (Low, Medium, High)
\[S(z)=\frac{1}{1+e^{−z}}\]
Perceptron
A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, usually represented by a series of vectors, belongs to a specific class.
In short, a perceptron is a single-layer neural network. They consist of four main parts including input values, synaptic weights and bias, net sum, and an activation function.
Exponential Family
Generative and Discriminative models
- training classifiers involve estimating \(f: x \rightarrow y\), or \(p(y\|x)\)
- both of these models were generally used in supervised learning problems
Discriminative model
- models the decision boundary between the classes
- learns the conditional probability distribution \(p(y\|x)\)
- they:
- assume some functional form for \(p(y\|x)\)
- estimate parameters of \(p(y\|x)\) directly from training data
- examples :
- logistic regression
- scalar vector machine
- traditional neural networks
- nearest neighbour
- conditional random fields (CRF’s)
Generative model
- explicitly models the actual distribution of each class
- learns the joint probability distribution \(p(x,y)\)
- predicts the conditional probability with the help of bayes theorem
- they:
- assume some functional form for \(p(y), p(x\|y)\)
- estimate parameters of \(p(x\|y), p(y)\) directly from training data
- use bayes rule to calculate \(p(y \|x)\)
- in the end both of them is predicting the conditional probability \(p(y\|x)\)
- examples
- naïve Bayes
- Bayesian networks
- Markov random fields
- hidden Markov models (HMM’s)
Gaussian Discriminant Analysis
- explain generative models
- there’s one part missing: how do we get \(P(x\|y)\)
- GDA makes the assumption that
- \(P(x\|y)\) is distributed according to a Multivariate normal distribution
- \(P(y)\) is distributed according to a Bernoulli distribution
- it means we know a little more about our data. With this additional help, we get better results than say, logistic regression
- also means that GDA usually requires less data as compared to logistic regression
- doesn’t work very well when our assumption doesn’t hold
Steps:
- Train the data and obtain a discriminate fn, tells which class a data point has higher prob of belonging to
- compute \(\mu\) and \(\sigma\) for each class, then calculate prob that data belongs to it, class with highest prob chosen
Naive Bayes
- a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems
- mainly used in text classification that includes a high-dimensional training dataset
- simple and most effective Classification algorithms
- probabilistic classifier, which means it predicts on the basis of the probability of an object
- examples are spam filtration, sentimental analysis, and classifying articles
- it assumes that the occurrence of a certain feature is independent of the occurrence of other features
- uses Bayes theorem : \(p(C_{k}\mid \mathbf {x} )={\frac {p(C_{k})\ p(\mathbf {x} \mid C_{k})}{p(\mathbf {x} )}}\)
Working
- Convert the given dataset into frequency tables.
- Generate Likelihood table by finding the probabilities of given features.
- Now, use Bayes theorem to calculate the posterior probability
eg. \(p(Y\|sun)={\frac {p(Y)p(sun\|Y)}{p(sun)}}, p(N\|sun)={\frac {p(N)p(sun\|N)}{p(sun)}}\)
Types
- Gaussian:
- assumes that features follow a normal distribution
- if predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution
- Multinomial:
- used when the data is multinomial distributed
- used for document classification problems
- uses the frequency of words for the predictors
- Bernoulli:
- works similar to the Multinomial classifier, but the predictor variables are the independent Booleans variables
Support Vector Machines
- are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis
- goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future
- this decision boundary is called a hyperplane
- a data point is viewed as a \(p\)-dimensional vector
- we want to know whether we can separate such points with a \((p-1)\)-dimensional hyperplane - this is called a linear classifier
- SVMs can efficiently also perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces
Hyperplane and Support vector
- A hyperplane in an n-dimensional Euclidean space is a flat, n-1 dimensional subset of that space that divides the space into two disconnected parts.
- there can be multiple lines/decision boundaries to segregate the classes in n-dimensional space
- SVM algorithm finds the closest point of the lines from both the classes, these points are called support vectors
- the distance between the vectors and the hyperplane is called as margin, the goal of SVM is to maximize this margin
- the hyperplane with maximum margin is called the optimal hyperplane.
Types
Kernel
- SVM algorithms use a set of mathematical functions that are defined as the kernel
- function of kernel is to take data as input and transform it into the required form
- different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid
- most used type of kernel function is RBF. Because it has localized and finite response along the entire x-axis
- kernel functions return the inner product between two points in a suitable feature space, thus defining a notion of similarity, with little computational cost even in very high-dimensional spaces
- examples:
Ensemble Models
Bagging
- objective is to create several subsets of data from training sample chosen randomly with replacement
- each collection of subset data is used to train their decision trees
- we get an ensemble of different models
- average of all the predictions from different trees are used which is more robust than a single decision tree classifier
Steps:
- Suppose there are \(N\) observations and \(M\) features in training data set, sample from training data set is taken randomly with replacement
- A subset of \(M\) features are selected randomly and whichever feature gives the best split is used to split the node iteratively
- The tree is grown to the largest
- Above steps are repeated \(n\) times and prediction is given based on the aggregation of predictions from \(n\) number of trees.
Advantages:
- Reduces over-fitting of the model.
- Handles higher dimensionality data very well.
- Maintains accuracy for missing data.
Disadvantages:
- Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the classification and regression model.
Boosting
- used to create a collection of predictors
- learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors
- consecutive trees are fit and at every step, the goal is to improve the accuracy from the prior tree
- when an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly
- process converts weak learners into better performing model.
Steps:
- Draw a random subset of training samples \(d1\) without replacement from the training set \(D\) to train a weak learner \(C1\)
- Draw second random training subset \(d2\) without replacement from the training set and add \(50\) percent of the samples that were previously falsely classified/misclassified to train a weak learner \(C2\)
- Find the training samples d3 in the training set D on which \(C1\) and \(C2\) disagree to train a third weak learner \(C3\)
- Combine all the weak learners via majority voting.
Advantages
- Supports different loss function
- Works well with interactions.
Disadvantages
- Prone to over-fitting.
- Requires careful tuning of different hyper-parameters.
Adaboost
- Weak models are added sequentially, trained using the weighted training data.
- The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.
- The process continues until a pre-set number of weak learners have been created (a user parameter) or no further improvement can be made on the training dataset.
- Once completed, you are left with a pool of weak learners each with a stage value.
- A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes.
- Predictions are made by calculating the weighted average of the weak classifiers.
This post is licensed under CC BY 4.0 by the author.