Machine Learning Cheat Sheet — Data Processing Techniques
Skewed Data
Outliers affect the distribution. If a value is significantly below the expected range, it will drag the distribution to the left, making the graph left-skewed or negative. Alternatively, if a value is significantly above the expected range, it will drag the distribution to the right, making the graph right-skewed or positive.
There are different ways to handle skewed data:
- Log Function + 1, Normalization
- Hyperbolic Tangent
- Percentile Linearization
Data Normalization
For tree based models, we may not need data normalization;
For linear models, we need to normalize the data, so that all the feature values fall in range (0, 1). Otherwise, the model prediction results will be biased on the features with large values.
Disadvantage:
Data normalization is sensitive to outliers.
One-hot Encoding
Convert categorical data into binary variables. For example, convert feature gender into two columns, male and female, with value 0 or 1.
Imbalanced Data Set
Data is not well distributed among different classes. For example, only 0.1% of the transactions are fraud.