Linear Regression

By Alomgir KabirPublished about a year ago • 3 min read

Introduction

Linear Regression is one of the most fundamental algorithms in machine learning, commonly used for predicting continuous numerical values such as house prices, sales revenue, stock prices, and temperature. It is a statistical method that models the relationship between independent variables (features) and a dependent variable (target) using a straight line.

This article will cover the basics of linear regression, its mathematical formulation, assumptions, and a Python implementation using Scikit-Learn.

Understanding Linear Regression

Types of Linear Regression

Linear regression can be classified into two main types:

Simple Linear Regression – Uses one independent variable to predict the dependent variable.

Multiple Linear Regression – Uses multiple independent variables to predict the dependent variable.

Mathematical Representation

Simple Linear Regression Formula

y=mX+c

Where:

y is the dependent variable (predicted value).

X is the independent variable (input feature).

m is the slope (coefficient).

c is the y-intercept.

Multiple Linear Regression Formula

y=b0+b1 X+b 2X2+...+bnXn

Where:

b0 is the intercept.

b0,𝑏1,𝑏2,...,𝑏𝑛 are the coefficients of the independent variables.

Objective of Linear Regression

The primary goal of linear regression is to find the best-fitting line that minimizes the error between actual and predicted values. This is done using Least Squares Method, which minimizes the Mean Squared Error (MSE):

MSE = 1/n∑(yi-^yi)^2

Where:

yi= is the actual value.

^yi=is the predicted value.

n is the total number of observations.

Assumptions of Linear Regression

For linear regression to work effectively, the following assumptions should be met:

Linearity – The relationship between independent and dependent variables should be linear.

Independence – Observations should be independent of each other.

Homoscedasticity – The variance of residuals (errors) should remain constant.

Normality – Residuals should be normally distributed.

Implementing Linear Regression in Python

Let's implement a simple linear regression model to predict house prices based on the number of rooms in a house using the Boston Housing Dataset.

Step 1: Importing Required Libraries

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

Step 2: Loading the Dataset

data=pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv")

print(data.head())

Step 3: Selecting Features and Target Variable

For this example, we will use ‘rm’ (average number of rooms per dwelling) as our independent variable and ‘medv’ (median house price) as the dependent variable.

X = data[['rm']]

y = data['medv']

Step 4: Splitting Data into Training and Testing Sets

We split the data into 80% training and 20% testing sets.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 5: Training the Linear Regression Model

model = LinearRegression()

model.fit(X_train, y_train)

Step 6: Making Predictions

y_pred = model.predict(X_test)

Step 7: Evaluating the Model

We calculate the Mean Squared Error (MSE) and R-squared (R²) score to measure model performance.

mse = mean_squared_error(y_test, y_pred)

r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")

print(f"R-squared Score: {r2}")

Step 8: Visualizing the Regression Line

plt.scatter(X_test, y_test, color='blue', label='Actual Prices')

plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line')

plt.xlabel('Average Number of Rooms')

plt.ylabel('Median House Price')

plt.title('Linear Regression Model')

plt.legend()

plt.show()

Extending to Multiple Linear Regression

If we want to include more features like crime rate (crim), tax, and pupil-teacher ratio (ptratio), we modify our dataset:

X = data[['rm', 'crim', 'tax', 'ptratio']]

y = data['medv']

We then follow the same steps as above, but the model will now take multiple independent variables into account.

Key Takeaways

Linear Regression is a powerful technique for predicting continuous values.

It finds the best-fit line by minimizing the Mean Squared Error.

Simple Linear Regression considers only one independent variable, whereas Multiple Linear Regression considers multiple variables.

It is computationally efficient and easy to interpret but assumes a linear relationship between variables.

Conclusion

Linear Regression is a fundamental technique in machine learning that is widely used for predictive modeling in various fields such as finance, real estate, and healthcare. It is an essential concept for any data scientist or machine learning engineer to understand.

bullying college courses high school student teacher

About the Creator

Alomgir Kabir

I am a machine learning engineer.I work with computer vision, NLP, AI, generative AI, LLM models, Python, PyTorch, Pandas, NumPy, audio processing, video processing, and Selenium web scraping.

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Alomgir Kabir and writers in Education and other communities.

Linear Regression