Generate Synthetic Data to Build Robust Machine Learning Models in Data Scares Scenario

Explore statistical approaches to transform experts knowledge into data with practical examples

By Waseem Khan Published 8 months ago • 3 min read

Waseemorakzai

Towards AI

Follow publication

Photo by Matt Ridley on Unsplash

This member-only story is on us. Upgrade to access all of Medium.

Member-only story

Generate Synthetic Data to Build Robust Machine Learning Models in Data Scares Scenario

Explore statistical approaches to transform experts knowledge into data with practical examples

Kuriko Iwai

24 min read

5 days ago

Listen

Introduction

Machine learning models need to be trained on sufficient, high-quality data that will recur in the future to make accurate predictions.

Generating synthetic data is a powerful technique to address various challenges, especially when real-world data is inaccurate or insufficient for model training.

In this article, I’ll explore major synthetic data generation methods that leverages statistical / probabilistic models. I’ll examine:

univariate approaches driven by PDF estimations and

multivariate approaches like Kernel Density Estimation and Bayesian Networks,

taking a real-world use case for an example.

What is Synthetic Data Generation

Synthetic data generation is data enhancement technique in machine learning to generate new data from scratch.

Its fundamental approaches involve using statistical models or deep generative models to analyze the patterns and relationships within existing data to produce new data:

Statistical / Probabilistic Models:

Univariate approach: Column-by-column PDF Estimation

Multivariate approach: Kernel Density Estimation (KDE), Bayesian Networks

Deep Generative Models:

Generative Adversarial Networks (GANs)

Variational Autoencoders (VAEs)

In the statistical model approach, we can take univariate approaches where a column (feature) is examined and generated independently or univariate approaches where multiple correlated columns are generated and updated accordingly.

Why Synthetic Data: Typical Use Cases

Even sophisticated machine learning algorithms perform poorly when training data is scarce or inaccurate.

But securing high-quality data, crucial for robust models, is challenging due to its unavailability or imperfections.

Among many data enhancement techniques, generating synthetic data can offer comprehensive solutions to tackle these challenges:

Real data is unavailable or extremely limited: New products, rare events, niche scenarios, hypothetical or future conditions lack historical data to test the model’s resilience.

Data privacy is paramount: Sensitive information (e.g., medical records, financial transactions, personal identifiable information) cannot be directly used for development or sharing due to regulations (GDPR, HIPAA) or ethical concerns.

Accelerating development: Providing immediate access to large datasets, removing dependencies on lengthy data collection or access approval processes.

Now, I’ll detail how we can leverage this in a real-world use case.

Univariate Approach: Column-by-Column PDF Estimation

Univariate approaches focus on understanding a probability density function (PDF) of each individual column (or feature) in the dataset.

This approach is based on the assumption where each column is independent without any correlation with other columns in the dataset.

Hence, sampling occurs independently for each column. When generating synthetic data, values for one column are drawn from its estimated univariate distribution, regardless of the values being generated for any other column.

Best when:

The dataset has dominate columns or simple enough to disregard correlations.

Require computational efficiencies.

How the Univariate Approach Works

In univariate approach, we take simple three steps to generate synthetic dataset:

Step 1. Estimate a theoretical PDF of the column we want to generate.

Step 2. Based on the estimation, generate new synthetic values for the column (I’ll cover various methods later).

Step 3. Combine the synthetic column with an existing dataset.

Step 1 is critical for synthetic data to accurately reflect a true underlying distribution of real-world data.

In the next section, I’ll explore how the algorithm mimic PDFs of the real world data, starting with an overview of PDFs.

What is Probability Density Function (PDF)

A Probability Density Function (PDF) is a statistic method to describe the probability of a continuous variable taking on a given value.

The following figure visualizes a valid PDF for a continuous random variable X uniformly distributed between 0 and 5 (uniform distribution).

The area under the PDF curve indicates the probability of a random value falling within a certain range.

For instance, the probability of a random value x being in the range of 0.9 to 1.6 is 14% because the area under the PDF curve is estimated 0.14, as highlighted in pink.

bullying college courses high school trade school

About the Creator

Waseem Khan

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments

There are no comments for this story

Be the first to respond and start the conversation.

Keep reading

More stories from Waseem Khan and writers in Education and other communities.

Generate Synthetic Data to Build Robust Machine Learning Models in Data Scares Scenario

Explore statistical approaches to transform experts knowledge into data with practical examples

About the Creator

Waseem Khan

Reader insights

Be the first to share your insights about this piece.

Comments

Keep reading

3 Free AI Certifications For Beginners To Get Hired In 2025

"Beware the Ides of March"

How Agile Consulting Drives Innovation and Business Resilience

Long Live Love... 🫶🏾💕🫶🏾