Logistic Regression – Used for classification problems (e.g., spam detection)

Logistic Regression

By Alomgir KabirPublished about a year ago • 3 min read

Logistic regression is a fundamental statistical and machine learning technique widely used for binary classification problems, such as distinguishing between spam and non-spam emails. It is a type of regression analysis but differs from linear regression by predicting categorical outcomes rather than continuous ones. Despite its simplicity, logistic regression often yields competitive performance for a wide range of classification tasks.

What is Logistic Regression?

At its core, logistic regression is a method that models the probability of a binary outcome based on one or more predictor variables. These predictor variables (features) can be either continuous or categorical. Unlike linear regression, which predicts a continuous output, logistic regression produces a probability score between 0 and 1, which can then be mapped to a class label (0 or 1).

How Does Logistic Regression Work?

The logistic regression model is trained using a technique called Maximum Likelihood Estimation (MLE), which finds the values of the coefficients that maximize the likelihood of the observed data given the model. The logistic function ensures that the output of the model is always between 0 and 1, making it interpretable as a probability.

For binary classification, the predicted probability is compared to a threshold (usually 0.5). If the probability is greater than or equal to the threshold, the instance is classified as the positive class (1), otherwise, it is classified as the negative class (0).

Key Assumptions of Logistic Regression

Linear Relationship: Logistic regression assumes a linear relationship between the log-odds of the dependent variable and the independent variables.

Independence of Observations: The model assumes that the observations are independent of one another, meaning the outcome of one observation does not influence another.

No or Little Multicollinearity: The predictor variables should not be highly correlated with each other, as this can lead to unstable estimates.

Large Sample Size: Logistic regression performs best with a relatively large sample size to ensure reliable estimates of the coefficients.

Applications of Logistic Regression

Logistic regression is widely used in various domains, including:

Spam Detection: One of the most common uses of logistic regression is for email spam classification. By analyzing features like the frequency of certain words, the length of the email, or the sender’s address, logistic regression can classify an email as either spam (1) or not spam (0).

Medical Diagnosis: Logistic regression is used to model the likelihood of disease presence. For example, a logistic regression model might predict the probability that a patient has a certain condition based on features like age, blood pressure, and cholesterol levels.

Code

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Import necessary libraries

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Sample dataset: Replace this with your actual dataset

# For example, let's assume 'email_data.csv' contains features like 'email_length', 'word_count', etc.

# and 'label' column which is 0 (not spam) or 1 (spam).

data = pd.read_csv('email_data.csv')

# Features (X) and Labels (y)

X = data.drop('label', axis=1) # Drop the label column for feature matrix

y = data['label'] # Label column (target)

# Split the dataset into training and testing sets (80% train, 20% test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Feature scaling (important for Logistic Regression)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

# Create the logistic regression model

model = LogisticRegression()

# Train the model

model.fit(X_train_scaled, y_train)

# Predict on the test set

y_pred = model.predict(X_test_scaled)

# Evaluate the model

print("Accuracy Score:", accuracy_score(y_test, y_pred))

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# For additional insights, we can also check the model coefficients (weights)

print("\nModel Coefficients (weights):")

print(model.coef_)

college high school interview student teacher

About the Creator

Alomgir Kabir

I am a machine learning engineer.I work with computer vision, NLP, AI, generative AI, LLM models, Python, PyTorch, Pandas, NumPy, audio processing, video processing, and Selenium web scraping.

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Alomgir Kabir and writers in Education and other communities.

Logistic Regression – Used for classification problems (e.g., spam detection)