Guide on data Exploration Article provided a very good tip and various insights on Data Exploration.
If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won’t. There are no shortcuts for data exploration. let manual data exploration techniques come to your rescue.
Data exploration, cleaning and preparation can take up to 70% of your total project time. It is really true that enterprise like customer data or research oriented data, collection, exploration, cleaning and preparation itself takes many months and sometime years.
Following are the Activities involved.
1. Variable Identification
2. Uni-variate Analysis
3. Bi-variate Analysis
4. Missing Value Treatment
5. Outlier Detection
6. Feature Engineering - Variable Transformation and Variable Creation
My take away are as follows,
1. Data provided can be input / predictor, output / target. Their types (based on business context) and their category like continuous or discrete (categorical) are important as they have to be treated with respect to them.
2. For Uni-variate Analysis of single variable, we measure Central tendency and Dispersion measure. Normality check becomes part of it.
Central Tendency: Mean, Median, Mode, Min, Max.
Dispersion - Range, Quartile, IQR(Inter Quartile Range), Variance, Standard Deviation, Skewness and Kurtosis.
Histogram and Box Plot
3. For Bi-variate Analysis - based on Variable Category, we can analyze in different ways.
For Continuous-Continuous data - Correlation and Covariance are calculated.
For Discrete - Continuous data - z/t test and ANOVA.
4. There was no mention of explicit Normality check mentioned but it is very important to check it out. But it is difficult to check the entire population of data.
5. Treatment of missing value can be as simple as deletion, mean/median usage or as complex as another prediction based on non missing data set or with random evenly spread data.
6. Correct the missing data in the source itself, avoid imputation.
7. Outlier can be a real one or artificial. If real treat that set of records separately. If artificial try to treat them in a similar fashion as missing data.
8. Experimental Error, Observation Missing, Measurement Error, Data Entry error, Intentional Outlier (like Self reported height or weight), data processing error and sampling error are commonly observed.
9. Impact of missing data and outlier will have drastic effect on the Model.
10. Variable Transformation - Log (+ve data), Square Root (0, +ve data), Cubic (-ve, 0, +ve data), binning. Done to avoid Non linearity and curvy-linear spread. It is also mainly done to remove skewness and kurtosis on data distribution (especially normal)
11. Variable Creation - creation of weekday, weekend, holidays from date field. Dummy variable for category like 1, 0 for male and female indicator in separate columns. Derived variable like age from salutation (Mr, Miss, Mrs)
Reference:
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won’t. There are no shortcuts for data exploration. let manual data exploration techniques come to your rescue.
Data exploration, cleaning and preparation can take up to 70% of your total project time. It is really true that enterprise like customer data or research oriented data, collection, exploration, cleaning and preparation itself takes many months and sometime years.
Following are the Activities involved.
1. Variable Identification
2. Uni-variate Analysis
3. Bi-variate Analysis
4. Missing Value Treatment
5. Outlier Detection
6. Feature Engineering - Variable Transformation and Variable Creation
My take away are as follows,
1. Data provided can be input / predictor, output / target. Their types (based on business context) and their category like continuous or discrete (categorical) are important as they have to be treated with respect to them.
2. For Uni-variate Analysis of single variable, we measure Central tendency and Dispersion measure. Normality check becomes part of it.
Central Tendency: Mean, Median, Mode, Min, Max.
Dispersion - Range, Quartile, IQR(Inter Quartile Range), Variance, Standard Deviation, Skewness and Kurtosis.
Histogram and Box Plot
3. For Bi-variate Analysis - based on Variable Category, we can analyze in different ways.
For Continuous-Continuous data - Correlation and Covariance are calculated.
Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))
For Discrete - Discrete data - Two way table, stacked column chart and Chi-Square testFor Discrete - Continuous data - z/t test and ANOVA.
4. There was no mention of explicit Normality check mentioned but it is very important to check it out. But it is difficult to check the entire population of data.
5. Treatment of missing value can be as simple as deletion, mean/median usage or as complex as another prediction based on non missing data set or with random evenly spread data.
6. Correct the missing data in the source itself, avoid imputation.
7. Outlier can be a real one or artificial. If real treat that set of records separately. If artificial try to treat them in a similar fashion as missing data.
8. Experimental Error, Observation Missing, Measurement Error, Data Entry error, Intentional Outlier (like Self reported height or weight), data processing error and sampling error are commonly observed.
9. Impact of missing data and outlier will have drastic effect on the Model.
10. Variable Transformation - Log (+ve data), Square Root (0, +ve data), Cubic (-ve, 0, +ve data), binning. Done to avoid Non linearity and curvy-linear spread. It is also mainly done to remove skewness and kurtosis on data distribution (especially normal)
11. Variable Creation - creation of weekday, weekend, holidays from date field. Dummy variable for category like 1, 0 for male and female indicator in separate columns. Derived variable like age from salutation (Mr, Miss, Mrs)
Reference:
https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/
No comments:
Post a Comment