Best Approaches for Missing Data Imputation in scikit-learn ColumnTransformer Class:

Replace Missing Numeric Values with Predictions

Question

You are a machine learning consultant who has been contracted to evaluate and correct a model built by a client's machine learning team.

The team's model performs poorly against their selected metric when the team tries to fit the model.

The model is built using scikit-learn and uses the RandomForest algorithm.

After doing some investigation you see that the data source has missing values in both numeric and categorical features and the machine learning team chose the strategy of dropping the features with missing data.

You want to use the scikit-learn ColumnTransformer class to transform the missing data, specifically replacing the missing data in the numeric and categorical data features using imputation.

You have decided to replace the numeric missing values with predictions for the missing values, and the categorical missing values with the most frequent value in the feature.

Which of the following are the best approaches to achieving your goal? (Select TWO)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Correct Answers: C and E.

Option A is incorrect.

The scikit-learn SimpleImputer does not have an ExtraTreesRegressor estimator.

Option B is incorrect.

This approach will replace all of your missing numerical values with 0, the default for the constant strategy.

This will not achieve your goal of replacing the missing numeric values with predictions for the missing values.

Option C is correct.

Using the scikit-learn SimpleImputer with the most_frequent strategy and then using the OneHotEncoder class will get the results you are trying to achieve.

Option D is incorrect.

You could use the IterativeImputer class with the most_frequent strategy, but the IterativeImputer class is released as experimental.

The SimpleImputer class using the most_frequent strategy is a better choice.

Option E is correct.

Using the scikit-learn KNNImputer class to impute your numeric missing values will give you the results you need of replacing the missing values with estimates.

Reference:

Please see the Scikit-learn modules page titled sklearn.impute.SimpleImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html), the Scikit-learn modules page titled sklearn.impute.IterativeImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer), the Scikit-learn modules page titled sklearn.impute.KNNImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer), the Scikit-learn modules page titled 6.4

Imputation of missing values (https://scikit-learn.org/stable/modules/impute.html), and the Kaggle page titled Pipelines (https://www.kaggle.com/alexisbcook/pipelines)

The goal is to use the scikit-learn ColumnTransformer class to transform the missing data, specifically replacing the missing data in the numeric and categorical data features using imputation. The numeric missing values should be replaced with predictions for the missing values, and the categorical missing values should be replaced with the most frequent value in the feature. Therefore, the best approaches to achieving this goal are:

A. Create a one-step preprocessing transformer for the numerical missing values that uses a SimpleImputer using an ExtraTreesRegressor estimator. This approach is a good choice because it uses an ExtraTreesRegressor estimator to predict the missing values in the numeric features. The ExtraTreesRegressor is an ensemble method that uses a decision tree to predict the missing values. It has the advantage of being less prone to overfitting compared to other decision tree-based models.

B. Create a one-step preprocessing transformer for the numerical missing values that uses a SimpleImputer using the constant strategy. This approach is a reasonable choice, but not as good as approach A. It uses the SimpleImputer with the constant strategy to replace the missing values with a constant value. While this approach is simple and easy to implement, it does not take advantage of the available data to make predictions for the missing values.

C. Create a two-step preprocessing transformer for the categorical missing values that uses a SimpleImputer using the most_frequent strategy then uses the OneHotEncoder in step two to encode the categorical data. This approach is a good choice for the categorical features. It first uses the SimpleImputer with the most_frequent strategy to replace the missing values with the most frequent value in the feature. Then it uses the OneHotEncoder to encode the categorical data. This approach ensures that the missing values are replaced with a meaningful value that is representative of the feature.

D. Create a two-step preprocessing transformer for the categorical missing values that uses an IterativeImputer using the most_frequent strategy then uses the OneHotEncoder in step two to encode the categorical data. This approach is not a good choice for this problem because it uses the IterativeImputer to predict the missing values in the categorical features. The IterativeImputer is a model-based imputation method that works well for numeric data but may not work as well for categorical data.

E. Create a one-step preprocessing transformer for the numerical missing values that uses a KNNImputer. This approach is not a good choice for this problem because it uses the KNNImputer to predict the missing values in the numeric features. The KNNImputer is a distance-based imputation method that works well for numeric data but may not work as well for categorical data. Additionally, the KNNImputer can be computationally expensive for large datasets.

In summary, the best approaches to achieving the goal of transforming missing data in this scenario are A and C. Approach A uses the ExtraTreesRegressor to predict the missing values in the numeric features, and approach C uses the SimpleImputer with the most_frequent strategy to replace the missing values in the categorical features, followed by OneHotEncoding.