AWS ML Speciality Notes (Part 2.1)
This post gives a quick review on sanitizing and preparing data for modelling.
Sanitizing and preparing data for modelling
Data labeling tools
SageMaker Ground Truth
- data labeling service to label data with the option to use human annotators through Amazon Mechanical Turk, third-party vendors, or own private workforce
- also generate labeled synthetic data without manually collecting or labeling real-world data
SageMaker Ground Truth Plus
- create high-quality training datasets without building labeling applications or managing workforces
- provides an expert workforce that is trained on ML tasks
- upload your data, and then SageMaker Ground Truth Plus creates and manages data labeling workflows and the workforce on your behalf
Mechanical Turk
- crowdsourcing marketplace
- makes it easier for customers to outsource jobs to a distributed workforce to perform virtually
Descriptive Statistics
- aim to summarize a sample (summary statistics)
- in contrast to inferential statistics - which use the data to learn about the population that the sample of data is thought to represent
- types:
- a measure of location, or central tendency
- a measure of statistical dispersion
- standard deviation
- variance
- square of standard deviation, same problems, different units too
- IQR (Interquartile range) \(= Q3 - Q1\)
- much less affected by outliers or a skewed data set
- semi-interquartile range \(= \frac{1}{2} (Q3 - Q1)\)
- do not take into account every data point
- a measure of the shape of the distribution
- skewness
- measure of the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
- negative/left-skewed, left-tailed -> left tail being drawn out
- often leads to the mean being skewed to the left of median
- but not a necessity
- positive/right-skewed, right-tailed -> vice-versa
- kurtosis
- high kurtosis -> heavy tails, or outliers
- low kurtosis -> light tails, or lack of outliers
- skewness
- a measure of statistical dependence
- list of probability distributions
Formatting, normalizing, augmenting, and scaling data
- about data cleansing
- data augmentation - to increase amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data
- computer vision
- cropping, flipping, translation, scaling, rotation, color, adding noise
- NLP
- synonym replacement, text substitution, random insertion/swap/deletion, word/sentence shuffling
- audio
- cropping out a portion of data, noise injection, shifting time, speed tuning changing pitch, mixing background noise and masking frequency
- computer vision
- about feature scaling
Inferential statistics
- gain understanding of the population data by analyzing the samples obtained from it
- hypothesis testing:
- test assumptions and draw conclusions about the population from the available sample data
- involves setting a null and alternative hypothesis, then conducting a statistical test of significance
- hypotheses :
- alternative hypothesis : $$ H_1 \rightarrow $ there is an effect
- the thing we are trying to prove
- null hypothesis : $$ H_0 \rightarrow $ there is no effect
- opposite of alternative, or the status quo
- should include equality $$ (\leq or \geq or =) $
- hypotheses are always about the population parameters, not the sample values / statistics
- alternative hypothesis : $$ H_1 \rightarrow $ there is an effect
- level of significance : \(\alpha = 0.05\)
- probability you will say \(H_0\) is wrong when it is correct
- Type 1 error
- probability you will say \(H_0\) is wrong when it is correct
- \(p\)-value : probability that if the null hypothesis were true, sampling variation would produce an estimate that is further away from the hypothesised value than our data estimate
- how likely is it to get a result like this if the null hypothesis were true
- if $$ p < \alpha \rightarrow $ reject the null hypothesis
- \(p\) is low, null must go
- if $$ p >= \alpha \rightarrow $ unable to reject the null hypothesis
- z-test
- t-test
Handling missing values and outliers
- Article on handling missing data
- Multiple Imputations by Chained Equations (MICE) algorithm
- imputes or ‘fills in’ the missing data in a dataset through an iterative series of predictive models
- each specified variable in the dataset is imputed in each iteration using the other variables in the dataset
- iterations will be run continuously until convergence has been met
- MICE is a better imputation method than naive approaches (filling missing values with 0, dropping columns)
This post is licensed under CC BY 4.0 by the author.