IP University - ML - Unit 3 Notes

24 Jul 2020

This post gives a quick review on the third unit of the syllabus for the Machine Learning (ML) subject in the IP University syllabus. It is a continuation of the second post on the subject.

Unsupervised Learning

type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision
forms one of the three main categories of machine learning
Some of the most common algorithms used in unsupervised learning include:
1. Clustering
2. Anomaly detection
3. Neural Networks
4. Approaches for learning latent variable models.

Clustering

task of dividing the population or data points into a number of groups such that
- data points in the same groups are more similar to other data points in the same group than those in other groups
- to segregate groups with similar traits and assign them into clusters
is used for analyzing data which does not include pre-labeled classes
grouped together using the concept of maximizing intra class similarity and minimizing the similarity between differing classes
applications
- pattern recognition
- image analysis
- information retrieval
- bioinformatics
- data compression
- computer graphics
- machine learning

K-Means clustering

Given an initial set of \(k\) means \(m_1, m_2, …, m_k\) (see below), the algorithm proceeds by alternating between two steps:

Assignment step: Assign each observation to the cluster with the nearest mean: that with the least squared Euclidean distance. Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means. Each observation is assigned to exactly one cluster, even if it could be assigned to two or more of them.

Update step: Recalculate means (centroids) for observations assigned to each cluster.

algorithm has converged when the assignments no longer change
does not guarantee to find the optimum
assigning objects to the nearest cluster by distance
a different distance function other than (squared) Euclidean distance may stop the algorithm from converging

Maximum Likelihood Estimation

common modeling problem involves how to estimate a joint probability distribution for a dataset
Density estimation involves selecting a probability distribution function and the parameters of that distribution that best explain the joint probability distribution of the observed data.
many techniques for solving this problem, although a common approach is called maximum likelihood estimation, or simply “maximum likelihood”
Maximum Likelihood Estimation involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample
limitation of maximum likelihood estimation is that it assumes that the dataset is complete, or fully observed. This does not mean that the model has access to all data; instead, it assumes that all variables that are relevant to the problem are present.

This is not always the case. There may be datasets where only some of the relevant variables can be observed, and some cannot, and although they influence other random variables in the dataset, they remain hidden, these unobserved or hidden variables are referred to as latent variables.

Mixture Model

mixture model is a model comprised of an unspecified combination of multiple probability distribution functions
statistical procedure or learning algorithm is used to estimate the parameters of the probability distributions to best fit the density of a given training dataset
Gaussian Mixture Model (GMM) is a mixture model that uses a combination of Gaussian (Normal) probability distributions and requires the estimation of the mean and standard deviation parameters for each
consider the case where a dataset is comprised of many points that happen to be generated by two different processes. The points for each process have a Gaussian probability distribution, but the data is combined and the distributions are similar enough that it is not obvious to which distribution a given point may belong
processes used to generate the data point represents a latent variable, e.g. process 0 and process 1. It influences the data but is not observable. As such, the EM algorithm is an appropriate approach to use to estimate the parameters of the distributions.

Brilliant

Expectation Maximization Algorithm

A latent variable model is a statistical model that relates a set of observable variables to a set of latent variables.

In the EM algorithm, the estimation-step would estimate a value for the process latent variable for each data point, and the maximization step would optimize the parameters of the probability distributions in an attempt to best capture the density of the data. The process is repeated until a good set of latent values and a maximum likelihood is achieved that fits the data.

E-Step: Estimate the expected value for each latent variable, which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
M-Step: Which computes parameters maximizing the expected log-likelihood found on the E step. These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.

Principal Component Analysis

The principal components of a collection of points in a real p-space are a sequence of \(p\) direction vectors where the \(i^{th}\) vector is the direction of a line that best fits the data while being orthogonal to the first \(i-1\) vectors.

a best-fitting line is defined as one that minimizes the average squared distance from the points to the line
directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated

PCA is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

used in exploratory data analysis and for making predictive models
commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible
first principal component can equivalently be defined as a direction that maximizes the variance of the projected data
the \(i^{th}\) principal component can be taken as a direction orthogonal to the first \(i-1\) principal components that maximizes the variance of the projected data
can be shown that the principal components are eigenvectors of the data’s covariance matrix.

Independent component analysis

a statistical and computational technique for revealing hidden factors that underlie sets of random variables, measurements, or signals
defines a generative model for the observed multivariate data, which is typically given as a large database of samples
the data variables are assumed to be linear mixtures of some unknown latent variables, and the mixing system is also unknown
the latent variables are assumed nongaussian and mutually independent, and they are called the independent components of the observed data
these independent components, also called sources or factors, can be found by ICA.

ICA is superficially related to principal component analysis and factor analysis. ICA is a much more powerful technique, however, capable of finding the underlying factors or sources when these classic methods fail completely.

Latent Semantic Indexing

an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text
based on the principle that words that are used in the same contexts tend to have similar meanings
feature of LSI is its ability to extract the conceptual content of a body of text by establishing associations between those terms that occur in similar contexts.
called “latent semantic indexing” because of its ability to correlate semantically related terms that are latent in a collection of text
uncovers the underlying latent semantic structure in the usage of words in a body of text and how it can be used to extract the meaning of the text in response to user queries, commonly referred to as concept searches
queries, or concept searches, against a set of documents that have undergone LSI will return results that are conceptually similar in meaning to the search criteria even if the results don’t share a specific word or words with the search criteria.

Benefits of LSI

helps overcome synonymy by increasing recall
used to perform automated document categorization
dynamic clustering based on the conceptual content of documents can also be accomplished
is inherently independent of language
not restricted to working only with words
automatically adapts to new and changing terminology, and has been shown to be very tolerant of noise
text does not need to be in sentence form for LSI to be effective

Spectral Clustering

a growing clustering algorithm which has performed better than many traditional clustering algorithms in many cases
treats each data point as a graph-node and thus transforms the clustering problem into a graph-partitioning problem
three fundamental steps:-

building the similarity graph:
- in the form of an adjacency matrix which is represented by \(A\)
- built in the following manners:-
  1. epsilon-neighbourhood graph:
    - parameter \(\epsilon\) is fixed beforehand
    - each point is connected to all the points which lie in it’s \(\epsilon\)-radius
    - if all the distances between any two points are similar in scale then typically the weights of the edges ie the distance between the two points are not stored
    - graph built is an undirected and unweighted graph.
  2. k-nearest neighbours
    - parameter \(k\) is fixed beforehand
    - \(u\) and \(v\), an edge is directed from \(u\) to \(v\) only if \(v\) is among the \(k\)-nearest neighbours of \(u\)
    - weighted and directed graph because it is not always the case that for each \(u\) having \(v\) as one of the \(k\)-nearest neighbours, it will be the same case for \(v\) having \(u\) among its \(k\)-nearest neighbours
  3. fully-connected graph: to build this graph, each point is connected with an undirected edge-weighted by the distance between the two points to every other point
projecting the data onto a lower dimensional space:
- to account for the possibility that members of the same cluster may be far away in the given dimensional space
- the dimensional space is reduced so that those points are closer in the reduced dimensional space and thus can be clustered together by a traditional clustering algorithm
- this matrix is then normalized for mathematical efficiency
clustering the data:
- involves clustering the reduced data by using any traditional clustering technique – typically k-means clustering
- each node is assigned a row of the normalized graph laplacian matrix
- then this data is clustered using any traditional technique
- properties:
  1. assumption-less: this clustering technique, unlike other traditional techniques do not assume the data to follow some property
  2. ease of implementation and speed: this algorithm is easier to implement than other clustering algorithms and is also very fast
  3. not-scalable: since it involves the building of matrices and computation of eigenvalues and eigenvectors it is time-consuming for dense datasets

Markov Model

a stochastic model used to model randomly changing systems
assumed that future states depend only on the current state, not on the events that occurred before it (that is, it assumes the Markov property)
this assumption enables reasoning and computation with the model that would otherwise be intractable
in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

Hidden Markov Model

Markov chain for which the state is only partially observable
observations are related to the state of the system, but they are typically insufficient to precisely determine the state
several well-known algorithms for hidden Markov models exist
given a sequence of observations
- the Viterbi algorithm will compute the most-likely corresponding sequence of states
- the forward algorithm will compute the probability of the sequence of observations
- the Baum–Welch algorithm will estimate the starting probabilities, the transition function, and the observation function of a hidden Markov model.

One common use is for speech recognition, where the observed data is the speech audio waveform and the hidden state is the spoken text. In this example, the Viterbi algorithm finds the most likely sequence of spoken words given the speech audio.

Numericals in

k-means
PCA
Markov models

Sharat Sachin Maximizing my potential