Decision Trees in Python

Decision Trees

By Alomgir KabirPublished about a year ago • 3 min read

Decision trees are versatile and intuitive models that are widely used for both classification and regression tasks. They split the data into subsets based on feature values, creating a tree-like structure where each internal node represents a decision based on a feature, and each leaf node represents a predicted outcome (class or value).

What is a Decision Tree?

A decision tree is a supervised machine learning algorithm that recursively splits the dataset into subsets based on the feature values. These splits are made to maximize the homogeneity of the target variable in each subset. In classification tasks, the goal is to separate the data into distinct classes, while in regression tasks, the aim is to predict a continuous value.

Structure of a Decision Tree

Root Node: Represents the entire dataset or population. It’s the first split based on the most important feature.

Internal Nodes: Represent features and the conditions that split the data.

Branches: The outcome of a decision from a parent node, leading to another internal node or a leaf node.

Leaf Nodes: Represent the final prediction, which is either a class label (in classification) or a continuous value (in regression).

How Does a Decision Tree Work?

At each step, the decision tree algorithm selects the feature that best splits the data based on a criterion such as:

Gini Impurity (for classification): A measure of how often a randomly chosen element from the set would be incorrectly classified. The Gini index ranges from 0 (perfect classification) to 1 (impure).

Entropy (for classification): A measure of uncertainty or impurity in a dataset. The aim is to reduce entropy (i.e., disorder) with each split.

Mean Squared Error (MSE) (for regression): A measure of the difference between actual and predicted values. The goal is to minimize this error.

Once the data is split, the process repeats recursively for each subset until a stopping criterion is met, such as a maximum tree depth or a minimum number of samples in a node.

Example of a Classification Decision Tree

Consider a simple classification task where we want to predict whether a customer will buy a product based on their age and income.

runing the Tree

Pruning is a technique used to combat overfitting by trimming down the tree. This can be done in two ways:

Pre-pruning: Stopping the tree from growing further by setting constraints (e.g., limiting the maximum depth or the minimum number of samples per leaf).

Post-pruning: Growing the tree fully and then trimming it back by removing nodes that provide little predictive power.

Applications of Decision Trees

Classification Problems:

Email spam detection

Customer churn prediction

Medical diagnosis (e.g., predicting diseases based on symptoms)

Regression Problems:

Predicting house prices based on features like location, size, and number of rooms

Stock price prediction

Forecasting sales based on historical data

Decision Trees in Python (using scikit-learn)

Here's an example implementation of a decision tree for classification using the scikit-learn library:

Code

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Import necessary libraries

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset (replace with your own dataset)

data = pd.read_csv('data.csv')

# Features (X) and Labels (y)

X = data.drop('label', axis=1) # Features

y = data['label'] # Target

# Split dataset into training and testing sets (80% train, 20% test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the decision tree classifier

model = DecisionTreeClassifier(random_state=42)

model.fit(X_train, y_train)

# Predict on the test set

y_pred = model.predict(X_test)

# Evaluate the model

print("Accuracy Score:", accuracy_score(y_test, y_pred))

print("\nConfusion Matrix:")

print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")

print(classification_report(y_test, y_pred))

# Visualizing the decision tree (optional)

from sklearn.tree import plot_tree

import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

plot_tree(model, filled=True, feature_names=X.columns, class_names=['Class 0', 'Class 1'], rounded=True)

plt.show()

college courses high school student teacher

About the Creator

Alomgir Kabir

I am a machine learning engineer.I work with computer vision, NLP, AI, generative AI, LLM models, Python, PyTorch, Pandas, NumPy, audio processing, video processing, and Selenium web scraping.

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Alomgir Kabir and writers in Education and other communities.

Decision Trees in Python

Decision Trees

About the Creator

Alomgir Kabir

Reader insights

Be the first to share your insights about this piece.

Comments

Keep reading

Decision Trees classification

Why the Sun Is Middle-Aged: Understanding the Life Stage of Our Star

Top 10 Digital Transformation Companies in 2026

My Heart Breaks With Love