In this post let’s explore regression algorithm and script it from scratch in python and R.
Linear regression is a simple and easy algorithm used in machine learning for predictive analysis. It is used to establish a relationship between an input variable(X) and output variable(Y). Input with single independent variable is predominantly called as linear regression and input with more than one independent variable is called as multiple linear regression.
Model Representation
The input variable is the independent variable and output variable is the dependent variable.
Equation of the line,
Y=b0 + b1.X
where b1 is the slope of the line and b0 (bias) is the intercept of Y and they are also called as weights.
Our end goal is to draw a line between X(independent) and Y(dependent) which fine tunes the relationship between them.
The problem is how to find these coefficients(weights). Here comes the learning procedure.
Loss Function (Cost Function)
A loss function (error) is a measure of performance of a prediction model, in terms its ability to predict the expected outcome. By reducing the error we can get the accurate values for b0 and b1. Mean Square Error(MSE) is the function used for error calculation.
Error Calculation steps are as follows.
- Find the difference between actual Y and predicted Y with respect to each value of X.
- Square the difference.
- Calculate the mean of the squares for each value of X.
Here yᵢ is the actual value and ȳᵢ is the predicted value.
Thus the loss function is defined.
Lets get into the part of reducing error, by finding b0 and b1 (weights).
Ordinary Least Square Method(OLS)
One of the easiest and most basic methods to calculate the coefficients.
x̄ is the mean value of input variable X and ȳ is the mean value of output variable Y.
OLS method is predominantly used in cases of small data with one input variable. Whereas this method is not recommended for multiple regression, as it takes more computation time and it is a complex task to perform.
Implementation of OLS in Python .
Gradient Descent (GD)
Is an iterative machine learning optimization algorithm to reduce the cost function (error), so that we have models that make accurate predictions. It is used to find the optimised values for the coefficients b0 and b1.
- Initially, the coefficients are assigned to zero (0) and fix the value of learning rate (α)
- Calculate the gradient for the loss.
- Update the coefficients and calculate the loss value.
- Do step 2 and 3 in iterative manner to reduce the error as much as possible.
α is the learning rate and the part after minus in the formula is a partial derivative of cost function with respect to βj. This is called as gradient.
By changing the value of βj we can reduce cost function value. The value change of βj happens in baby steps such that we can obtain the minimum value of loss. If we find drastic changes in value, we might miss the minimum value. In order to make a small change in value we use the parameter called Learning rate (α) to control the gradient update.
Learning Rate
It controls by how much we should modify the weights (b0, b1) with respect to the loss gradient. Learning rates can either be initialized randomly or a specific value.
Implementation in Python.
Evaluation Metrics
R-squared Value
R-squared value is a statistical measure of how close the data is, fitted to the regression line. It is a value between 0 and 1, that measures how well our regression line fits our data.
SSres is the residual sum of squared errors (Yi-Y-pred). SStot is the total sum of squared errors(Yi-Y-mean).
It will become negative, if the model is completely wrong.
If the R-Square Value is close to 1 then, it’s a good one.
Here is the Git reference for the whole blog.
I would be very happy to receive your queries and comments. Kindly post in the comments section. Do follow me for more ML/DL blogs.