ML Review - Linear Regression
This post gives a quick review of linear regression, not for first time learners, but for those who have already studied it and want a quick refresher.
- linear regression can be used to fit a model to an observed dataset of values of the response (dependent variable) and explanatory variables (independent variables / features)
is the vector of input variables / features, , where is the number of features, with being the intercept term. is the output variable / target. is a training example. is the training set, where is the number of examples in the training set.
Goal :
to learn a function
Equations
If we decide to approximate
This is called simple / univariate linear regression for
Then we can define the cost function as:
This is the ordinary least squares (OLS) cost function, working to minimize the mean squares error (MSE).
Goal :
to choose
Vectorized
Then the vector of predictions,
We can rewrite the least-squares cost as following, replacing the explicit sum by matrix multiplication:
Normal Equation
The normal equation is an analytical solution to the linear regression problem with a ordinary least square cost function. That is, to find the value of
Solving for
Here is a post containing the derivation of the normal equation.
Gradient Descent
Gradient descent is based on the observation that if the function
Thus if we repeatedly apply the following update rule,
For a specific paramter
Using the definition of
Therefore, we repeatedly apply the following update rule:
This method looks at every example in the entire training set on every step, and is called batch gradient descent (BGD).
When the cost function
There is an alternative to BGD that also works very well:
This is stochastic gradient descent (SGD) (also incremental gradient descent), where we repeatedly run through the training set, and for each training example, we update the parameters using gradient of the error for that training example only.
Whereas BGD has to scan the entire training set before taking a single step, SGD can start making progress right away with each example it looks at.
Often, SGD gets
Adding regularization
Regularization is a technique to reduce overfitting in machine learning. This technique discourages learning a more complex or flexible model, by shrinking the parameters towards
We can regularize machine learning methods through the cost function using
The partial derivative of the cost function for lasso linear regression is:
Similarly for ridge linear regression,
These equations can be substituted into the general gradient descent update rule to get the specific lasso / ridge update rules.
Note:
is NOT constrained- scale the data before using Ridge regression
is a hyperparameter: bigger results in flatter and smoother model- Lasso tends to completely eliminate the weights of the least important features (i.e., setting them to 0) and it automatically performs feature selection
- Last way to constrain the weights is Elastic net, a combination of Ridge and Lasso
- When to use which?
- Ridge is a good default
- If you suspect some features are not useful, use Lasso or Elastic
- When features are more than training examples, prefer Elastic