Generate Synthetic Data to Build Robust Machine Learning Models in Data Scares Scenario
Explore statistical approaches to transform experts knowledge into data with practical examples

Waseemorakzai
Towards AI
·
Follow publication
Photo by Matt Ridley on Unsplash
This member-only story is on us. Upgrade to access all of Medium.
Member-only story
Generate Synthetic Data to Build Robust Machine Learning Models in Data Scares Scenario
Explore statistical approaches to transform experts knowledge into data with practical examples
Kuriko Iwai
Kuriko Iwai
Follow
24 min read
·
5 days ago
Listen
Share
More
Introduction
Machine learning models need to be trained on sufficient, high-quality data that will recur in the future to make accurate predictions.
Generating synthetic data is a powerful technique to address various challenges, especially when real-world data is inaccurate or insufficient for model training.
In this article, I’ll explore major synthetic data generation methods that leverages statistical / probabilistic models. I’ll examine:
univariate approaches driven by PDF estimations and
multivariate approaches like Kernel Density Estimation and Bayesian Networks,
taking a real-world use case for an example.
What is Synthetic Data Generation
Synthetic data generation is data enhancement technique in machine learning to generate new data from scratch.
Its fundamental approaches involve using statistical models or deep generative models to analyze the patterns and relationships within existing data to produce new data:
Statistical / Probabilistic Models:
Univariate approach: Column-by-column PDF Estimation
Multivariate approach: Kernel Density Estimation (KDE), Bayesian Networks
Deep Generative Models:
Generative Adversarial Networks (GANs)
Variational Autoencoders (VAEs)
In the statistical model approach, we can take univariate approaches where a column (feature) is examined and generated independently or univariate approaches where multiple correlated columns are generated and updated accordingly.
Why Synthetic Data: Typical Use Cases
Even sophisticated machine learning algorithms perform poorly when training data is scarce or inaccurate.
But securing high-quality data, crucial for robust models, is challenging due to its unavailability or imperfections.
Among many data enhancement techniques, generating synthetic data can offer comprehensive solutions to tackle these challenges:
Real data is unavailable or extremely limited: New products, rare events, niche scenarios, hypothetical or future conditions lack historical data to test the model’s resilience.
Data privacy is paramount: Sensitive information (e.g., medical records, financial transactions, personal identifiable information) cannot be directly used for development or sharing due to regulations (GDPR, HIPAA) or ethical concerns.
Accelerating development: Providing immediate access to large datasets, removing dependencies on lengthy data collection or access approval processes.
Now, I’ll detail how we can leverage this in a real-world use case.
Univariate Approach: Column-by-Column PDF Estimation
Univariate approaches focus on understanding a probability density function (PDF) of each individual column (or feature) in the dataset.
This approach is based on the assumption where each column is independent without any correlation with other columns in the dataset.
Hence, sampling occurs independently for each column. When generating synthetic data, values for one column are drawn from its estimated univariate distribution, regardless of the values being generated for any other column.
Best when:
The dataset has dominate columns or simple enough to disregard correlations.
Require computational efficiencies.
How the Univariate Approach Works
In univariate approach, we take simple three steps to generate synthetic dataset:
Step 1. Estimate a theoretical PDF of the column we want to generate.
Step 2. Based on the estimation, generate new synthetic values for the column (I’ll cover various methods later).
Step 3. Combine the synthetic column with an existing dataset.
Step 1 is critical for synthetic data to accurately reflect a true underlying distribution of real-world data.
In the next section, I’ll explore how the algorithm mimic PDFs of the real world data, starting with an overview of PDFs.
What is Probability Density Function (PDF)
A Probability Density Function (PDF) is a statistic method to describe the probability of a continuous variable taking on a given value.
The following figure visualizes a valid PDF for a continuous random variable X uniformly distributed between 0 and 5 (uniform distribution).
The area under the PDF curve indicates the probability of a random value falling within a certain range.
For instance, the probability of a random value x being in the range of 0.9 to 1.6 is 14% because the area under the PDF curve is estimated 0.14, as highlighted in pink.


Comments
There are no comments for this story
Be the first to respond and start the conversation.