
Introduction
Linear Regression is one of the most fundamental algorithms in machine learning, commonly used for predicting continuous numerical values such as house prices, sales revenue, stock prices, and temperature. It is a statistical method that models the relationship between independent variables (features) and a dependent variable (target) using a straight line.
This article will cover the basics of linear regression, its mathematical formulation, assumptions, and a Python implementation using Scikit-Learn.
Understanding Linear Regression
Types of Linear Regression
Linear regression can be classified into two main types:
Simple Linear Regression β Uses one independent variable to predict the dependent variable.
Multiple Linear Regression β Uses multiple independent variables to predict the dependent variable.
Mathematical Representation
Simple Linear Regression Formula
y=mX+c
Where:
y is the dependent variable (predicted value).
X is the independent variable (input feature).
m is the slope (coefficient).
c is the y-intercept.
Multiple Linear Regression Formula
y=b0+b1 X+b 2X2+...+bnXn
Where:
b0 is the intercept.
b0,π1,π2,...,ππ are the coefficients of the independent variables.
Objective of Linear Regression
The primary goal of linear regression is to find the best-fitting line that minimizes the error between actual and predicted values. This is done using Least Squares Method, which minimizes the Mean Squared Error (MSE):
MSE = 1/nβ(yi-^yi)^2
Where:
yi= is the actual value.
^yi=is the predicted value.
n is the total number of observations.
Assumptions of Linear Regression
For linear regression to work effectively, the following assumptions should be met:
Linearity β The relationship between independent and dependent variables should be linear.
Independence β Observations should be independent of each other.
Homoscedasticity β The variance of residuals (errors) should remain constant.
Normality β Residuals should be normally distributed.
Implementing Linear Regression in Python
Let's implement a simple linear regression model to predict house prices based on the number of rooms in a house using the Boston Housing Dataset.
Step 1: Importing Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
Step 2: Loading the Dataset
data=pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv")
print(data.head())
βStep 3: Selecting Features and Target Variable
For this example, we will use βrmβ (average number of rooms per dwelling) as our independent variable and βmedvβ (median house price) as the dependent variable.
X = data[['rm']]
y = data['medv']
Step 4: Splitting Data into Training and Testing Sets
We split the data into 80% training and 20% testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 5: Training the Linear Regression Model
model = LinearRegression()
model.fit(X_train, y_train)
Step 6: Making Predictions
y_pred = model.predict(X_test)
Step 7: Evaluating the Model
We calculate the Mean Squared Error (MSE) and R-squared (RΒ²) score to measure model performance.
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")
Step 8: Visualizing the Regression Line
plt.scatter(X_test, y_test, color='blue', label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Average Number of Rooms')
plt.ylabel('Median House Price')
plt.title('Linear Regression Model')
plt.legend()
plt.show()
Extending to Multiple Linear Regression
If we want to include more features like crime rate (crim), tax, and pupil-teacher ratio (ptratio), we modify our dataset:
X = data[['rm', 'crim', 'tax', 'ptratio']]
y = data['medv']
We then follow the same steps as above, but the model will now take multiple independent variables into account.
Key Takeaways
Linear Regression is a powerful technique for predicting continuous values.
It finds the best-fit line by minimizing the Mean Squared Error.
Simple Linear Regression considers only one independent variable, whereas Multiple Linear Regression considers multiple variables.
It is computationally efficient and easy to interpret but assumes a linear relationship between variables.
Conclusion
Linear Regression is a fundamental technique in machine learning that is widely used for predictive modeling in various fields such as finance, real estate, and healthcare. It is an essential concept for any data scientist or machine learning engineer to understand.
About the Creator
Alomgir Kabir
I am a machine learning engineer.I work with computer vision, NLP, AI, generative AI, LLM models, Python, PyTorch, Pandas, NumPy, audio processing, video processing, and Selenium web scraping.

Comments
There are no comments for this story
Be the first to respond and start the conversation.