Data Tips

Data Analytics and Data Science

By Bahati MulishiPublished 2 years ago • 10 min read

Essential Data Analysis Techniques Every Analyst Should Know

1. Descriptive Statistics: Understanding measures of central tendency (mean, median, mode) and measures of spread (variance, standard deviation) to summarize data.

2. Data Cleaning: Techniques to handle missing values, outliers, and inconsistencies in data, ensuring that the data is accurate and reliable for analysis.

3. Exploratory Data Analysis (EDA): Using visualization tools like histograms, scatter plots, and box plots to uncover patterns, trends, and relationships in the data.

4. Hypothesis Testing: The process of making inferences about a population based on sample data, including understanding p-values, confidence intervals, and statistical significance.

5. Correlation and Regression Analysis: Techniques to measure the strength of relationships between variables and predict future outcomes based on existing data.

6. Time Series Analysis: Analyzing data collected over time to identify trends, seasonality, and cyclical patterns for forecasting purposes.

7. Clustering: Grouping similar data points based on characteristics, useful in customer segmentation and market analysis.

8. Dimensionality Reduction: Techniques like PCA (Principal Component Analysis) to reduce the number of variables in a dataset while preserving as much information as possible.

9. ANOVA (Analysis of Variance): A statistical method used to compare the means of three or more samples, determining if at least one mean is different.

10. Machine Learning Integration: Applying machine learning algorithms to enhance data analysis, enabling predictions, and automation of tasks.

Coding and Aptitude Round before the interview

Coding challenges are meant to test your coding skills (especially if you are applying for ML engineer role). The coding challenges can contain algorithm and data structures problems of varying difficulty. These challenges will be timed based on how complicated the questions are. These are intended to test your basic algorithmic thinking.

Sometimes, a complicated data science question like making predictions based on Twitter data are also given. These challenges are hosted on HackerRank, HackerEarth, CoderByte etc. In addition, you may even be asked multiple-choice questions on the fundamentals of data science and statistics. This round is meant to be a filtering round where candidates whose fundamentals are little shaky are eliminated. These rounds are typically conducted without any manual intervention, so it is important to be well prepared for this round.

Sometimes a separate Aptitude test is conducted or along with the technical round an aptitude test is also conducted to assess your aptitude skills. A Data Scientist is expected to have a good aptitude as this field is continuously evolving and a Data Scientist encounters new challenges every day. If you have appeared for GMAT / GRE or CAT, this should be easy for you.

Resources for Prep:

For algorithms and data structures preplacode and Hackerrank are good resources.

For aptitude prep, you can refer to India Bixand Practice Aptitude.

With respect to data science challenges, practice well on GLabs and Kaggle.

Brilliant is an excellent resource for tricky math and statistics questions.

For practicing SQL, SQL Zoo and Mode Analytics are good resources that allow you to solve the exercises in the browser itself.

Things to Note:

Ensure that you are calm and relaxed before you attempt to answer the challenge. Read through all the questions before you start attempting the same. Let your mind go into problem-solving mode before your fingers do!

In case, you are finished with the test before time, recheck your answers and then submit.

Sometimes these rounds don’t go your way, you might have had a brain fade, it was not your day, etc. Don’t worry! Shake it off for there is always a next time and this is not the end of the world.

NLP Steps

1. Import Libraries:

NLP modules: Popular choices include NLTK and spaCy. These libraries offer functionalities for various NLP tasks like tokenization, stemming, and lemmatization.

2. Load the Dataset:

This involves loading the text data you want to analyze. This could be from a text file, CSV file, or even an API that provides textual data.

3. Text Preprocessing:

This is a crucial step that cleans and prepares the text data for further processing. Here's a breakdown of the sub-steps you mentioned:

Removing HTML Tags: This removes any HTML code embedded within the text, as it's not relevant for NLP tasks.

Removing Punctuations: Punctuations like commas, periods, etc., don't hold much meaning on their own. Removing them can improve the analysis.

Stemming (Optional): This reduces words to their base form (e.g., "running" becomes "run").

Expanding Contractions: This expands contractions like "don't" to "do not" for better understanding by the NLP system.

4. Tokenization:

This breaks down the text into individual units, typically words. It allows us to analyze the text one element at a time.

5. Stemming (Optional, can be done in Text Preprocessing):

As mentioned earlier, stemming reduces words to their base form.

6. Part-of-Speech (POS) Tagging:

This assigns a grammatical tag (e.g., noun, verb, adjective) to each word in the text. It helps understand the function of each word in the sentence.

7. Lemmatization:

Similar to stemming, lemmatization reduces words to their base form, but it considers the context and aims for a grammatically correct root word (e.g., "running" becomes "run").

8. Label Encoding (if applicable):

If your task involves classifying text data, you might need to convert textual labels (e.g., "positive," "negative") into numerical values for the model to understand.

9. Feature Extraction:

This step involves creating features from the preprocessed text data that can be used by machine learning models.

Bag-of-Words (BOW): Represents text as a histogram of word occurrences.

10. Text to Numerical Vector Conversion:

This converts the textual features into numerical vectors that machine learning models can understand. Here are some common techniques:

BOW (CountVectorizer): Creates a vector representing word frequencies.

TF-IDF Vectorizer: Similar to BOW but considers the importance of words based on their document and corpus frequency.

Word2Vec: This technique represents words as vectors based on their surrounding words, capturing semantic relationships.

GloVe: Another word embedding technique similar to Word2Vec, trained on a large text corpus.

11. Data Splitting:

The preprocessed data is often split into training, validation, and test sets.

12. Model Building:

This involves choosing and training an NLP model suitable for your task.

Data Science job listings can be confusing because:

- some expect Data Scientists to be like Data Engineers and want you to build ridiculous pipelines from scratch

- some expect Data Scientists to be like Business Analysts and require you to build Tableau dashboards for shareholders

- some expect Data Scientists to be like Software Engineers and want you to create scalable applications for serving ML models

- some expect Data Scientists to be like MLops Engineers and ask you to set up and maintain CI/CD workflows

A-Z of essential data science concepts

A: Algorithm - A set of rules or instructions for solving a problem or completing a task.

B: Big Data - Large and complex datasets that traditional data processing applications are unable to handle efficiently.

C: Classification - A type of machine learning task that involves assigning labels to instances based on their characteristics.

D: Data Mining - The process of discovering patterns and extracting useful information from large datasets.

E: Ensemble Learning - A machine learning technique that combines multiple models to improve predictive performance.

F: Feature Engineering - The process of selecting, extracting, and transforming features from raw data to improve model performance.

G: Gradient Descent - An optimization algorithm used to minimize the error of a model by adjusting its parameters iteratively.

H: Hypothesis Testing - A statistical method used to make inferences about a population based on sample data.

I: Imputation - The process of replacing missing values in a dataset with estimated values.

J: Joint Probability - The probability of the intersection of two or more events occurring simultaneously.

K: K-Means Clustering - A popular unsupervised machine learning algorithm used for clustering data points into groups.

L: Logistic Regression - A statistical model used for binary classification tasks.

M: Machine Learning - A subset of artificial intelligence that enables systems to learn from data and improve performance over time.

N: Neural Network - A computer system inspired by the structure of the human brain, used for various machine learning tasks.

O: Outlier Detection - The process of identifying observations in a dataset that significantly deviate from the rest of the data points.

P: Precision and Recall - Evaluation metrics used to assess the performance of classification models.

Q: Quantitative Analysis - The process of using mathematical and statistical methods to analyze and interpret data.

R: Regression Analysis - A statistical technique used to model the relationship between a dependent variable and one or more independent variables.

S: Support Vector Machine - A supervised machine learning algorithm used for classification and regression tasks.

T: Time Series Analysis - The study of data collected over time to detect patterns, trends, and seasonal variations.

U: Unsupervised Learning - Machine learning techniques used to identify patterns and relationships in data without labeled outcomes.

V: Validation - The process of assessing the performance and generalization of a machine learning model using independent datasets.

W: Weka - A popular open-source software tool used for data mining and machine learning tasks.

X: XGBoost - An optimized implementation of gradient boosting that is widely used for classification and regression tasks.

Y: Yarn - A resource manager used in Apache Hadoop for managing resources across distributed clusters.

Z: Zero-Inflated Model - A statistical model used to analyze data with excess zeros, commonly found in count data.

If you want to become a Data Scientist, you NEED to have product sense.

10 interview questions to test your product sense 👇

1. Netflix believes that viewers who watch foreign language content are more likely to remain subscribed. How would you prove or disprove this hypothesis?

2. LinkedIn believes that users who regularly update their skills get more job offers. How would you go about investigating this?

3. Snapchat is considering ways to capture an older demographic. As a Data Scientist, how would you advise your team on this?

4. Spotify leadership is wondering if they should divest from any product lines. How would you go about making a recommendation to the leadership team?

5. YouTube believes that creators who produce Shorts get better distribution on their Longs. How would you prove or disprove this hypothesis?

6. What are some suggestions you have for improving the Airbnb app? How would you go about testing this?

7. Instagram wants to develop features to help travelers. What are some ideas you have to help achieve this goal?

8. Amazon Web Services (AWS) leadership is wondering if they should discontinue any of their cloud services. How would you go about making a recommendation to the leadership team?

9. Salesforce is considering ways to better serve small businesses. As a Data Scientist, how would you advise your team on this?

10. Asana is a B2B business, and they’re considering ways to increase user adoption of their product.

Pandas vs. Polars: Which one should you use for your next data project?

Here’s a comparison to help you choose the right tool:

1. 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲:

𝗣𝗮𝗻𝗱𝗮𝘀: Great for small to medium-sized datasets but can slow down with larger data due to its row-based memory layout.

𝗣𝗼𝗹𝗮𝗿𝘀: Optimized for speed with a columnar memory layout, making it much faster for large datasets and complex operations.

2. 𝗘𝗮𝘀𝗲 𝗼𝗳 𝗨𝘀𝗲:

𝗣𝗮𝗻𝗱𝗮𝘀: Highly intuitive and widely adopted, making it easy to find resources, tutorials, and community support.

𝗣𝗼𝗹𝗮𝗿𝘀: Newer and less intuitive for those used to Pandas, but it's catching up quickly with comprehensive documentation and growing community support.

3. 𝗠𝗲𝗺𝗼𝗿𝘆 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆:

𝗣𝗮𝗻𝗱𝗮𝘀: Can be memory-intensive, especially with large DataFrames. Requires careful management to avoid memory issues.

𝗣𝗼𝗹𝗮𝗿𝘀: Designed for efficient memory usage, handling larger datasets better without requiring extensive optimization.

4. 𝗔𝗣𝗜 𝗮𝗻𝗱 𝗦𝘆𝗻𝘁𝗮𝘅:

𝗣𝗮𝗻𝗱𝗮𝘀: Large and mature API with extensive functionality for data manipulation and analysis.

𝗣𝗼𝗹𝗮𝗿𝘀: Offers a similar API to Pandas but focuses on a more modern and efficient approach. Some differences in syntax may require a learning curve.

5. 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹𝗶𝘀𝗺:

𝗣𝗮𝗻𝗱𝗮𝘀: Lacks built-in parallelism, requiring additional libraries like Disk for parallel processing.

𝗣𝗼𝗹𝗮𝗿𝘀: Built-in parallelism out of the box, leveraging multi-threading to speed up computations.

Choose Pandas for its simplicity and compatibility with existing projects. Go for Polars when performance and efficiency with large datasets are important.

Randomized experiments are the gold standard for measuring impact. Here’s how to measure impact with randomized trials. 👇

𝟏. 𝐃𝐞𝐬𝐢𝐠𝐧 𝐄𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭

Planning the structure and methodology of the experiment, including defining the hypothesis, selecting metrics, and conducting a power analysis to determine sample size.

⤷ Ensures the experiment is well-structured and statistically sound, minimizing bias and maximizing reliability.

𝟐. 𝐈𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭 𝐕𝐚𝐫𝐢𝐚𝐧𝐭𝐬

Creating different versions of the intervention by developing and deploying the control (A) and treatment (B) versions.

⤷ Allows for a clear comparison between the current state and the proposed change.

𝟑. 𝐂𝐨𝐧𝐝𝐮𝐜𝐭 𝐓𝐞𝐬𝐭

Choosing the right statistical test and calculating test statistics, such as confidence intervals, p-values, and effect sizes.

⤷ Ensures the results are statistically valid and interpretable.

𝟒. 𝐀𝐧𝐚𝐥𝐲𝐳𝐞 𝐑𝐞𝐬𝐮𝐥𝐭𝐬

Evaluating the data collected from the experiment, interpreting confidence intervals, p-values, and effect sizes to determine statistical significance and practical impact.

⤷ Helps determine whether the observed changes are meaningful and should be implemented.

𝟓. 𝐀𝐝𝐝𝐢𝐭𝐢𝐨𝐧𝐚𝐥 𝐅𝐚𝐜𝐭𝐨𝐫𝐬

⤷ Network Effects: User interactions affecting experiment outcomes.

⤷ P-Hacking: Manipulating data for significant results.

⤷ Novelty Effects: Temporary boost from new features.

interview industry

About the Creator

Bahati Mulishi

Practical advice on remote work, IT careers, and professional skills to help you stay work-ready anywhere in the world.

Reader insights

Be the first to share your insights about this piece.

How does it work?

Add your insights

Comments (3)

Alyssa wilkshore2 years ago
So so amazing .i love your content and subscribed. Kindly reciprocate by subscribing to me also . thank you and keep it up
Latasha karen2 years ago
Interesting piece
ReadShakurr2 years ago
Thanks for sharing

Keep reading

More stories from Bahati Mulishi and writers in Geeks and other communities.

Data Tips

Data Analytics and Data Science

About the Creator

Bahati Mulishi

Reader insights

Be the first to share your insights about this piece.

Comments (3)

Keep reading

Navigating the Saturated Job Market: Unlocking Opportunities in Data Science and Software Engineering

No Other Choice (2025)

The Andy Griffith Show: The two Arnolds who were Opie's friends

The BAFTA Awards

Data Tips

Data Analytics and Data Science

About the Creator

Bahati Mulishi

Reader insights

Be the first to share your insights about this piece.

Comments .css-1svwz57-Text{display:inline-block;color:var(--text-default-mute);}(3)

Keep reading

Navigating the Saturated Job Market: Unlocking Opportunities in Data Science and Software Engineering

No Other Choice (2025)

The Andy Griffith Show: The two Arnolds who were Opie's friends

The BAFTA Awards

Comments (3)