Data Preprocessing & Feature Engineering

Data preprocessing techniques & pipeline for ML

1. Data Quality & Data Leakage Checks

2. Handling Missing Data

Tip

To prevent data leakage, always perform imputation after splitting your data into training and test sets.

3. Handling Outliers

4. Splitting datasets into the training and test sets

5. Categorical Encoding

= converting categorical columns to numerical columns

In

It saves a ton of memory space by using category data type instead of object.

Common approaches

Other techniques

6. Feature Scaling

= a method used to normalize the range of independent variables or features of data

Standardization

x=xx¯σ

Normalization

x=xmin(x)max(x)min(x)
Which to choose, from The Hundred-Page Machine Learning Book

  • unsupervised learning algorithms, in practice, more often benefit from standardization than from normalization
  • standardization is also preferred for a feature if the values this feature takes are distributed close to a normal distribution (so-called bell curve)
  • again, standardization is preferred for a feature if it can sometimes have extremely high or low values (outliers); this is because normalization will “squeeze” the normal values into a very small range
  • In all other cases, normalization is preferable.”

Why use feature scaling

It is optional, but beneficial to gradient descent and calculation speed; useful in at least three circumstances:

Important

  • don’t apply feature scaling on dummy variables!
  • better apply feature scaling on train and test dataset separately. in case of Data Leakage. Best approach: fit scaler on train, and then apply the same fitted scaler to train and test.

7. Transforming Skewed Feature Distributions

8. Feature Engineering

= selecting, extracting, and transforming the most relevant features from the available data to build, after raw data is clean and normalized

Best practice for feature engineering