Data Preprocessing Methods for Categorical Values and Missing Values in Machine Learning | AWS Certified Machine Learning Exam Answer

Which Scikit-learn library methods should you use to perform these data preprocessing tasks? (Select TWO)?

Question

You work as a machine learning specialist for a scientific research lab that analyzes fossils found in geological research digs worldwide.

You are currently working on a project that is analyzing bones of ancient mammals found during archaeological excavations in Africa.

The data that the archaeologists provide to you for each specimen is exact for measurements, density, skeleton structural component, etc.

Your SageMaker Linear Learner model predicts the age of the specimen based on the collected data.

This data needs to be sanitized and categorized before you can feed it into your inference engine to get the estimated age.

You are using the SageMaker built-in Scikit-learn library to do your data preprocessing.

You need to transform categorical values such as skeletal components (femur, skull, rib cage, etc.) into numerical values.

You also need to replace missing values with meaningful estimates.

Which Scikit-learn library methods should you use to perform these data preprocessing tasks? (Select TWO)?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Correct Answers: C and E.

Option A is incorrect.

The Scikit-learn Normalizer normalizes values to a unit norm.

You need to transform categorical values into numerical representations, and you need to replace missing values.

Option B is incorrect.

The Scikit-learn Standardizer standardizes values to a unit norm.

You need to transform categorical values into numerical representations, and you need to replace missing values.

Option C is correct.

The SimpleImputer completes or estimates missing values.

This is one of the two sanitation tasks you need to perform.

Option D is incorrect.

The Scikit-learn Binarizer sets feature values to 0 or 1 according to a threshold.

You need to transform categorical values into numerical values that can represent many different categories, and you need to replace missing values.

Option E is correct.

The OneHotEncoder encodes categorical features into a one-hot numeric array with each entry in the array representing a category.

There are as many entries in the array as there are categories in the feature.

The ‘one' in a given array element represents a categorical value numerically.

References:

Please see the AWS Machine Learning blog titled Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn (https://aws.amazon.com/blogs/machine-learning/preprocess-input-data-before-making-predictions-using-amazon-sagemaker-inference-pipelines-and-scikit-learn/),

The Amazon SageMaker Examples titled Inference Pipeline with Scikit-learn and Linear Learner (https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_inference_pipeline/Inference%20Pipeline%20with%20Scikit-learn%20and%20Linear%20Learner.ipynb),

Amazon SageMaker developer guide titled Use Scikit-learn with Amazon SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/sklearn.html),

Scikit-learn API page titled sklearn.preprocessing.OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html),

Scikit-learn API page titled sklearn.impute.SimpleImputer (https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html),

Scikit-learn API page titled API Reference (https://scikit-learn.org/stable/modules/classes.html)

The two Scikit-learn library methods that should be used for the data preprocessing tasks are SimpleImputer and OneHotEncoder.

SimpleImputer is used for filling in missing values in a dataset. In this case, missing values in the dataset need to be replaced with meaningful estimates. SimpleImputer can replace missing values with statistical measures such as mean, median, or mode. This helps to prevent biases in the dataset, which can affect the accuracy of the model. In this scenario, SimpleImputer will be used to replace missing values with meaningful estimates before the dataset is fed into the inference engine to obtain the estimated age of the specimen.

OneHotEncoder is used to transform categorical values into numerical values. Categorical values such as skeletal components (femur, skull, rib cage, etc.) need to be transformed into numerical values to enable the machine learning model to process them. OneHotEncoder transforms each categorical value into a binary vector where each column represents a possible category value. The column corresponding to the category value is set to 1, and all other columns are set to 0. This helps to prevent biases in the dataset and improves the accuracy of the model. In this scenario, OneHotEncoder will be used to transform categorical values such as skeletal components into numerical values before the dataset is fed into the inference engine to obtain the estimated age of the specimen.

Normalizer is used to scale individual samples to have unit norm. This means that each sample (i.e., each row of the dataset) is scaled to have a Euclidean norm of 1. This is useful when dealing with sparse datasets and can help to prevent the domination of some features over others. However, in this scenario, Normalizer is not required as the data has already been collected with precise measurements.

StandardScaler is used to scale features by removing the mean and scaling to unit variance. This is useful when features have different units or scales. However, in this scenario, StandardScaler is not required as the data has already been collected with exact measurements.

Binarizer is used to binarize features by thresholding values above or below a threshold. This is useful when dealing with continuous variables that need to be converted into binary variables. However, in this scenario, Binarizer is not required as the categorical values will be transformed using OneHotEncoder.