Understanding Linear Regression: A Fundamental Tool for Analyzing Relationships in Data

Linear regression is used to model and analyze the relationship between a dependent variable and one or more independent variables. In other words, It captures the correlation between variables observed in the dataset and quantifies whether those correlations are statistically significant or not.

The Linear Regression Equation

In the simplest form, linear regression is represented as follows

$$y = b_o + b_1x$$

where,

  • y is the dependent variable

  • x is the independent variable

  • b_0 is the y-intercept

  • b1 is the slope

The y-intercept represents the value of the dependent variable when the independent variable is zero.

The slope represents how much the dependent variable changes for a unit change in the dependent variable.

Building on this, the core idea of Linear Regression is to find the best-fitting line that represents the relationship between both variables.

Finding the Best-Fitting Line

To find the best-fitted line, the regression model estimates the values of y-intercept and slope in a way that minimizes the difference between the actual values and predicted values. This is done by using the method of least squares.

Understanding Least Squares

Least squares aims to find the values of the regression coefficients b0 and b1 such that the sum of the squared residuals is minimized. The squared residuals are used to ensure that positive and negative differences do not cancel each other during the process.

Residual - the difference between actual y values and predicted y values

Once the least squares method finds the optimal values of b0 and b1, the resulting regression line represents the best linear approximation of the relationship between the dependent variable x and independent variable y.

Python Implementation: Calculating b0 and b1 using least squares

import numpy as np

X = np.array([1,2,3,4,5])
Y = np.array([2,4,7,8,11])

n = len(X)

X_mean = np.mean(X)
Y_mean = np.mean(Y)

covariance = np.sum((X - X_mean) * (Y - Y_mean)) / n
variance = np.sum((X - X_mean) ** 2) / n

b1 = covariance / variance
b0 = Y_mean - b1 * X_mean

print("y-intercept: ", b0)
print("slope: ", b1)

Output

y-intercept:  -0.20000000000000018
slope:  2.2

The regression line equation will be

$$y = -0.2 + 2.2x$$

Evaluating the Regression Model

Having obtained the coefficient values, the next crucial step is to validate the results and ensure their correctness. When it comes to evaluating the accuracy of the coefficients, three key statistical metrics comes to rescue:

  • R-squared (Coefficient of Determination): It is a statistical measure that represents the proportion of the variance in the dependent variable y from the independent variable x through the linear regression model. In simple terms, It tells how good is the fit. The value will be between 0 and 1.

    • If the R-squared value is zero, the model explains none of the variance in y, indicating that the line does not fit the data well.

    • If the R-squared value is one, the model explains all of the variance in y, indicating that the line fits the data.

The higher the R-squared value, the better the model fits the line to the data. However, R-squared can be misleading when adding more features (independent variables) to the model. In this scenario, Adjusted R-squared is used.

  • Adjusted R-squared: It is a modification of R-squared which takes the number of independent variables into account in the model. It provides an accurate evaluation of the model’s fit to the data by penalizing the irrelevant independent variables.

Similar to R-squared, a higher adjusted R-squared indicates a better fit of the model.

  • p-value: It is used to test the statistical significance of the relationship between the independent variable and the dependent variable. It is specifically for the regression coefficient of the independent variable b1.

    • If the p-value is low (usually < 0.05), the independent variable has a significant impact on the dependent variable.

    • If the p-value is high (usually > 0.05), the independent variable may not have a significant impact on the dependent variable.

Both of these scenarios can be due of a random chance.

Python Implementation: Calculating R-squared, Adjusted R-squared and p-value

import statsmodels.api as sm
import numpy as np

X = np.array([1,2,3,4,5])
Y = np.array([2,4,7,8,11])

intercept = sm.add_constant(X)

model = sm.OLS(Y, intercept)
results = model.fit()

r_squared = results.rsquared
adjusted_r_squared = results.rsquared_adj
p_value = results.pvalues[1]

print("R-squared: ", r_squared)
print("Adjusted R-squared: ", adjusted_r_squared)
print("p-value for b1: ", p_value)

Output

Assumptions of Linear Regression

Finally, Linear Regression comes with certain assumptions that need to be satisfied for the results to be reliable.

  • Linearity: The relationship between the independent variable x and the dependent variable y should be approximately linear.

  • Independence of Errors: The residuals in the model should be independent of each other. It ensures that each data point provides unique information to the model.

  • Homoscedasticity: This assumption suggests that the spread of data points around the line should be roughly constant.

  • Normality of errors: It implies that the residuals follow a normal distribution with a mean of zero.

  • No multicollinearity: In multivariate linear regression, independent variables should not depend on each other.

As this is my first article, please let me know if there are any errors and improvements. Thank you!