AWS ML Speciality (Part 3.2)

08 Feb 2023

This post gives a quick review on the selecting the appropriate model(s) for a given machine learning problem

Select the appropriate model(s) for an ML problem

The four aspects of a problem used for model selection:

Data types and format
Learning paradigm or domain
Problem type
Use case examples

Linear regression

linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables)
- only one input variable -> simple linear regression
- more than one input variable -> multiple linear regression
Cost function optimizes the regression coefficients or weights by measuring how a linear regression model is performing (finding the accuracy of the mapping function / hypothesis function). Eg : Mean Squared Error (MSE) cost function
use gradient descent to update the coefficients of the line by reducing the cost function
\(R^{2}\) (co-efficient of determination) is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variable(s) in a regression model
- 0 to 1 (or 0 to 100%)
Assumptions of Linear Regression
- Linear relationship between the features and target
- Small or no multicollinearity between the features
- homoscedasticity: errors are of the same variance throughout
- Normal distribution of error terms:
  - checked using the q-q plot. If the plot shows a straight line without any deviation, which means the error is normally distributed
- No autocorrelations: occurs if there is a dependency between residual errors.
polynomial regression : the original features are converted into polynomial features of required degree \(2,3,..,n\) and then modeled using a linear model

Logistic regression

used for predicting the categorical dependent variable using a given set of independent variables
uses logistic/sigmoid function \(h_ \theta (x) = \frac{\mathrm{1} }{\mathrm{1} + e^{-\theta^Tx}}\)
it gives the probabilistic values which lie between 0 and 1
Assumptions of Logistic Regression
- The dependent variable must be categorical in nature.
- The independent variable should not have multi-collinearity.

K-nearest neighbour

non-parametric, lazy learner algorithm
for classification, the output is a class membership, object being assigned to the class most common among its k nearest neighbors
for regression, the output is average of the values of k nearest neighbors
larger values of k reduces effect of the noise on the classification, but make boundaries between classes less distinct
Advantages : simple, robust, more effective for large training data
Disadvantages : needs to determine the value of K, and computation cost is high because of distance calculation

Support Vector Machines

maps training examples to points in space so as to maximise the width of the gap between the two categories
new examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall
best decision boundary is called a hyperplane
extreme points/vectors that help in creating the hyperplane are called support vectors
kernel functions : linear, polynomial, gaussian radial basis functions, sigmoid …

Naïve Bayes

family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong (naive) independence assumptions between the features
advantage of naive Bayes is that it only requires a small number of training data to estimate the parameters necessary for classification

K-means

Decision Trees

tree is built by splitting the source set, constituting the root node of the tree, into subsets—which constitute the successor children
splitting is based on a set of splitting rules based on classification features
process is repeated on each derived subset in a recursive manner called recursive partitioning
recursion is completed when the subset at a node has all the same values of the target variable, or when splitting no longer adds value to the predictions
Entropy is a measure of disorder or uncertainty
- \(E(S) = - \sum_{i=1}^{c} p_i \log_2 p_i\)
we need a metric to measure the reduction of this disorder in our target variable/class given additional information (features / independent variables) about it. This is Information Gain
- \(IG(Y,X) = E(Y) - E(Y X)\)

Ensemble

Stacking

architecture of a stacking model involves two or more base models, often referred to as level-0 models, and a meta-model that combines the predictions of the base models, referred to as a level-1 model
meta-model is trained on predictions made by base models on out-of-sample data
data not used to train the base models is fed to the base models, predictions are made, and these predictions, along with the expected outputs, provide the input and output pairs of the training dataset used to fit the meta-model
common approach to preparing the training dataset for the meta-model is via k-fold cross-validation of the base models, where the out-of-fold predictions are used as the basis for the training dataset for the meta-model
appropriate when multiple different machine learning models have skill on a dataset, but have skill in different ways
meta-model is often simple, such as linear and logistic

Blending

to describe the specific application of stacking where the meta-model is trained on the predictions made by base-models on a hold-out validation dataset. In this context, stacking is reserved for a meta-model that is trained on out-of fold predictions during a cross-validation procedure

Bagging

objective is to create several subsets of data from training sample chosen randomly with replacement
each collection of subset data is used to train their decision trees
we get an ensemble of different models
average of all the predictions from different trees are used which is more robust than a single decision tree classifier
Advantages: reduces over-fitting, handles higher dimensionality data, maintains accuracy for missing data
Disadvantages: since final prediction is based on the mean predictions from subset trees, might be imprecise
Random forests : bagging done on decision trees

Boosting

used to create a collection of predictors
learners are learned sequentially with early learners fitting simple models to the data and then analysing data for errors
consecutive trees are fit and at every step, the goal is to improve the accuracy from the prior tree
when an input is misclassified by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly
process converts weak learners into better performing model
Advantages : supports different loss function, works well with interactions
Disadvantages : prone to over-fitting, requires careful tuning of different hyperparameters

Gradient boosting

objective here is to minimize the loss function by adding weak learners using gradient descent. Since it is based on loss function, we’ll have loss functions like Mean squared error for regression and log-likelihood for classification.

XGBoost

open-source implementation of the gradient boosted trees algorithm
XGBoost minimizes a regularized (L1 and L2) objective function that combines a convex loss function (based on the difference between the predicted and target outputs) and a penalty term for model complexity (in other words, the regression tree functions)
training proceeds iteratively, adding new trees that predict the residuals or errors of prior trees that are then combined with previous trees to make the final prediction. It’s called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models

RNN

class of artificial NNs where connections between nodes can create a cycle -> allowing output from some nodes to affect subsequent input to the same nodes -> exhibit temporal dynamic behavior
can use their internal state (memory) to process variable length sequences of inputs
recurrent networks can have additional stored states, and the storage can be under direct control by the neural network -> referred to as gated state or gated memory, part of LSTM’s and GRUs

CNN

class of artificial NNs that use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers
specifically designed to process pixel data and are used in image recognition and processing

Transfer learning

is about leveraging feature representations from a pre-trained model, so you don’t have to train a new model from scratch
pre-trained models usually trained on massive datasets that are a standard CV benchmark
including the pre-trained models in a new model leads to lower training time and lower generalization error
useful when you have a small training dataset
- you can use the weights from the pre-trained models to initialize the weights of the new model
can also be applied to NLP problems

Incremental training

use the artifacts from an existing model and use an expanded dataset to train a new model
saves both time and resources
use it to:
- train a new model using an expanded dataset that contains an underlying pattern that was not accounted for in the previous training and which resulted in poor model performance
- use the model artifacts or a portion of the model artifacts from a popular publicly available model in a training job
- resume a training job that was stopped
- train several variants of a model, either with different hyperparameter settings or using different datasets
built-in algorithms currently support incremental training: Object Detection, Image Classification, and Semantic Segmentation
Recognition does not support incremental training

Collaborative filtering

is a technique used by recommender systems
based on (user, item, rating) tuples
unlike content-based filtering, it leverages other users’ experiences
concept behind collaborative filtering is that users with similar tastes (based on observed user-item interactions) are more likely to have similar interactions with items they haven’t seen before
provides better results for:
- diversity (how dissimilar recommended items are)
- serendipity (a measure of how surprising the successful or relevant recommendations are)
- novelty (how unknown recommended items are to a user)

Sharat Sachin Maximizing my potential

AWS ML Speciality (Part 3.2)

Select the appropriate model(s) for an ML problem

K-means

Ensemble

Gradient boosting

Related Posts