Data Analytics: Linear Regression Indepth

Tuesday, October 26, 2021

Linear Regression Indepth

# What is a Regression

* In Regression, we plot a graph between the variables which best fit the given data points. The machine learning model can deliver predictions regarding the data.

* In naïve words, “Regression shows a line or curve that passes through all the data points on a target-predictor graph in such a way that the vertical distance between the data points and the regression line is minimum.”

# Types of Regression models

* Linear Regression

* Polynomial Regression

* Logistics Regression

# Linear Regression in Machine Learning:

* Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc.

* Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable.

* The linear regression model provides a sloped straight line representing the relationship between the variables

* Linear regression is a quiet and simple statistical regression method used for predictive analysis and shows the relationship between the continuous variables. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis), consequently called linear regression.

* If there is a single input variable (x), such linear regression is called simple linear regression. And if there is more than one input variable, such linear regression is called multiple linear regression.

* The linear regression model gives a sloped straight line describing the relationship within the variables.

y=mx+c =a0+a1x

# Finding the best fit line:

* When working with linear regression, our main goal is to find the best fit line that means the error between predicted values and actual values should be minimized. The best fit line will have the least error.

* The different values for weights or the coefficient of lines (a0, a1) gives a different line of regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to calculate this we use cost function.

# Cost function-

* The different values for weights or coefficient of lines (a0, a1) gives the different line of regression, and the cost function is used to estimate the values of the coefficient for the best fit line.

* Cost function optimizes the regression coefficients or weights.

* It measures how a linear regression model is performing.

* We can use the cost function to find the accuracy of the mapping function, which maps the input variable to the output variable. This mapping function is also known as Hypothesis function.

* For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the average of squared error occurred between the predicted values and actual values.

* Using the MSE function, we will change the values of a0 and a1 such that the MSE value settles at the minima. Model parameters xi, b (a0,a1) can be manipulated to minimize the cost function. These parameters can be determined using the gradient descent method so that the cost function value is minimum

* It can be written as:

Math

* Given our simple linear equation y=mx+b, we can calculate MSE as:

* MSE=1N∑i=1n(yi−(mxi+b))2

* MSE=12N∑i=1n(yi−(W1x1+W2x2+W3x3))2

* N is the total number of observations (data points)

* 1N∑ni=1 is the mean

* yi is the actual value of an observation and mxi+b is our prediction

# Residuals:

* The distance between the actual value and predicted values is called residual. If the observed points are far from the regression line, then the residual will be high, and so cost function will high.

* If the scatter points are close to the regression line, then the residual will be small and hence the cost function.

# Gradient descent:

* Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE). A regression model uses gradient descent to update the coefficients of the line (a0, a1 => xi, b) by reducing the cost function by a random selection of coefficient values and then iteratively update the values to reach the minimum cost function.

* In the gradient descent algorithm, the number of steps you take is the learning rate, and this decides how fast the algorithm converges to the minima.

* To update a0 and a1, we take gradients from the cost function. To find these gradients, we take partial derivatives for a0 and a1.

* We can calculate the gradient of this cost function as:

* f′(m,b)=⎡⎣dfdmdfdb⎤⎦=[1N∑−xi⋅2(yi−(mxi+b))1N∑−1⋅2(yi−(mxi+b))]=[1N∑−2xi(yi−(mxi+b))1N∑−2(yi−(mxi+b))]

# Summary:

* In Regression, we plot a graph between the variables which best fit the given data points. Linear regression shows the linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis).To calculate best-fit line linear regression uses a traditional slope-intercept form. A regression line can be a Positive Linear Relationship or a Negative Linear Relationship.

* The goal of the linear regression algorithm is to get the best values for a0 and a1 to find the best fit line and the best fit line should have the least error. In Linear Regression, Mean Squared Error (MSE) cost function is used, which helps to figure out the best possible values for a0 and a1, which provides the best fit line for the data points. Using the MSE function, we will change the values of a0 and a1 such that the MSE value settles at the minima. Gradient descent is a method of updating a0 and a1 to minimize the cost function (MSE).

# Assumptions of Linear Regression:

Below are some important assumptions of Linear Regression. These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset.

* Linear relationship between the features and target:

Linear regression assumes the linear relationship between the dependent and independent variables.

* Small or no multicollinearity between the features:

Multicollinearity means high-correlation between the independent variables. Due to multicollinearity, it may difficult to find the true relationship between the predictors and target variables. Or we can say, it is difficult to determine which predictor variable is affecting the target variable and which is not. So, the model assumes either little or no multicollinearity between the features or independent variables.

* Homoscedasticity Assumption:

Homoscedasticity is a situation when the error term is the same for all the values of independent variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter plot.

* Normal distribution of error terms:

Linear regression assumes that the error term should follow the normal distribution pattern. If error terms are not normally distributed, then confidence intervals will become either too wide or too narrow, which may cause difficulties in finding coefficients.

It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which means the error is normally distributed.

* No autocorrelations:

The linear regression model assumes no autocorrelation in error terms. If there will be any correlation in the error term, then it will drastically reduce the accuracy of the model. Autocorrelation usually occurs if there is a dependency between residual errors.

PYTHON Code available in GITHUB Please find the code:

Simple LinearRegression

Data Analytics

Tuesday, October 26, 2021

Linear Regression Indepth

No comments:

Post a Comment

"🚀 Delta Lake's Vectorized Delete: The Secret to 10x Faster Data Operations!"

Search This Blog