Machine Learning Cheat Sheet — Data Processing Techniques

Data Pre-processing

Y Tech
5 min readJun 27, 2019

Skewed Data

Outliers affect the distribution. If a value is significantly below the expected range, it will drag the distribution to the left, making the graph left-skewed or negative. Alternatively, if a value is significantly above the expected range, it will drag the distribution to the right, making the graph right-skewed or positive.

There are different ways to handle skewed data:

  • Log Function + 1, Normalization
  • Hyperbolic Tangent
  • Percentile Linearization

Data Normalization

For tree based models, we may not need data normalization;
For linear models, we need to normalize the data, so that all the feature values fall in range (0, 1). Otherwise, the model prediction results will be biased on the features with large values.

Disadvantage:
Data normalization is sensitive to outliers.

One-hot Encoding

Convert categorical data into binary variables. For example, convert feature gender into two columns, male and female, with value 0 or 1.

Imbalanced Data Set

Data is not well distributed among different classes. For example, only 0.1% of the transactions are fraud.

--

--