The Data Science skill no one talks about
1. How can we deal with problems that arise when the data flows in from a variety of sources?

The Data Science skill no one talks about...
Every aspiring data scientist I talk to thinks their job starts when someone else gives them:
1. a dataset, and
2. a clearly defined metric to optimize for, e.g. accuracy
But it doesn’t.
It starts with a business problem you need to understand, frame, and solve. This is the key data science skill that separates senior from junior professionals.
Let’s go through an example.
Example
Imagine you are a data scientist at Uber. And your product lead tells you:
👩💼: “We want to decrease user churn by 5% this quarter”
We say that a user churns when she decides to stop using Uber.
But why?
There are different reasons why a user would stop using Uber. For example:
1. “Lyft is offering better prices for that geo” (pricing problem)
2. “Car waiting times are too long” (supply problem)
3. “The Android version of the app is very slow” (client-app performance problem)
You build this list ↑ by asking the right questions to the rest of the team. You need to understand the user’s experience using the app, from HER point of view.
Typically there is no single reason behind churn, but a combination of a few of these. The question is: which one should you focus on?
This is when you pull out your great data science skills and EXPLORE THE DATA 🔎.
You explore the data to understand how plausible each of the above explanations is. The output from this analysis is a single hypothesis you should consider further. Depending on the hypothesis, you will solve the data science problem differently.
For example…
Scenario 1: “Lyft Is Offering Better Prices” (Pricing Problem)
One solution would be to detect/predict the segment of users who are likely to churn (possibly using an ML Model) and send personalized discounts via push notifications. To test your solution works, you will need to run an A/B test, so you will split a percentage of Uber users into 2 groups:
The A group. No user in this group will receive any discount.
The B group. Users from this group that the model thinks are likely to churn, will receive a price discount in their next trip.
You could add more groups (e.g. C, D, E…) to test different pricing points.
In a nutshell
1. Translating business problems into data science problems is the key data science skill that separates a senior from a junior data scientist.
2. Ask the right questions, list possible solutions, and explore the data to narrow down the list to one.
3. Solve this one data science problem
How can we deal with problems that arise when the data flows in from a variety of sources?
There are many ways to go about dealing with multi-source problems. However, these are done primarily to solve the problems of:
Identifying the presence of similar/same records and merging them into a single recordRe-structuring the schema to ensure there is good schema integration
2. Where is Time Series Analysis used?
Since time series analysis (TSA) has a wide scope of usage, it can be used in multiple domains. Here are some of the places where TSA plays an important role:
Statistics
Signal processing
Econometrics
Weather forecasting
Earthquake prediction
Astronomy
Applied science
3. What are the ideal situations in which t-test or z-test can be used?
It is a standard practice that a t-test is used when there is a sample size less than 30 and the z-test is considered when the sample size exceeds 30 in most cases.
4. What is the usage of the NVL() function?
The NVL() function is used to convert the NULL value to the other value. The function returns the value of the second parameter if the first parameter is NULL. If the first parameter is anything other than NULL, it is left unchanged. This function is used in Oracle, not in SQL and MySQL. Instead of NVL() function, MySQL have IFNULL() and SQL Server have ISNULL() function.
5. What is the difference between DROP and TRUNCATE commands?
If a table is dropped, all things associated with that table are dropped as well. This includes the relationships defined on the table with other tables, access privileges, and grants that the table has, as well as the integrity checks and constraints.
However, if a table is truncated, there are no such problems as mentioned above. The table retains its original structure and the data is dropped.




Comments
There are no comments for this story
Be the first to respond and start the conversation.