Data Sanitization and Feature Preparation for Machine Learning | Best Practices | Exam MLS-C01

Data Sanitization and Feature Preparation for Machine Learning

Question

You work as a machine learning specialist for a polling organization using US census data to predict whether a given polling respondent earns greater than $75,000

Your company will then sell the polling prediction data to candidates running for various political office positions across the country.

You need to clean the polling data on which you wish to train your binary classification model.

Specifically, you need to remove duplicate rows with erroneous data, transform the income column into a label column with two values, transform the age column to a categorical feature by binning the column, scale the capital gain and capital losses columns, and finally split the data into train and test datasets.

Which of the options are the most efficient ways to achieve your data sanitizing and feature preparation? (Select TWO)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Correct Answers: B and D.

Option A is incorrect.

There is no SageMaker Scala SDK.

Also, there is no pandas PDLearnProcessing package.

SageMaker Processing jobs can be written in Python, using the SageMaker Python SDK.

You can leverage either the PySparkProcessor, SparkJarProcessor, or the SKLearnProcessor package to perform your preprocessing sanitizing and feature preparation tasks and also split your data into the training and test datasets.

Option B is correct.

SageMaker Processing jobs can be written in Python, using the SageMaker Python SDK.

You can leverage either the PySparkProcessor, SparkJarProcessor, or the SKLearnProcessor package to perform your preprocessing sanitizing and feature preparation tasks and also split your data into the training and test datasets.

Option C is incorrect.

There is no Data Wrangler container in the SageMaker Processing Job containers.

Option D is correct.

SageMaker Processing jobs can be written in Python, using the SageMaker Python SDK.

You can leverage either the PySparkProcessor, SparkJarProcessor, or the SKLearnProcessor package to perform your preprocessing sanitizing and feature preparation tasks and also split your data into the training and test datasets.

Option E is incorrect.

There is no SparkMLProcessor package in the SageMaker Processing service.

References:

Please see the AWS SageMaker developer guide titled Data Processing with Apache Spark (https://docs.aws.amazon.com/sagemaker/latest/dg/use-spark-processing-container.html),

The AWS Examples GitHub repository titled Amazon SageMaker Processing jobs (https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb)

The task is to clean and prepare polling data for binary classification, including removing duplicates, transforming columns, and splitting the data into training and test datasets. We need to choose the most efficient options to achieve this using Amazon SageMaker, a fully managed machine learning service provided by AWS.

Option A: Create a SageMaker Processing job using a SageMaker Scala SDK with Processing container leveraging the pandas PDLearnProcessor package that performs your required preprocessing sanitizing and feature preparation tasks and then splits the data into the training and test datasets.

This option uses the Scala SDK and the pandas PDLearnProcessor package to preprocess the data and split it into the training and test datasets. While using Scala for data processing can be efficient, this option is not ideal as it requires additional overhead to set up and manage a Scala environment, which may not be necessary for the given task.

Option B: Create a SageMaker Processing job using a SageMaker Python SDK with Processing container leveraging the scikit-learn SKLearnProcessor package that performs your required preprocessing sanitizing and feature preparation tasks and then splits the data into the training and test datasets.

This option uses the Python SDK and the scikit-learn SKLearnProcessor package to preprocess the data and split it into the training and test datasets. This is a good option as scikit-learn is a popular and widely used library for data preprocessing and machine learning. The SKLearnProcessor package is optimized for efficient processing on SageMaker and can be easily used with the Python SDK.

Option C: Create a SageMaker Processing job using a SageMaker Python SDK with Data Wrangler container leveraging the scikit-learn SKLearnProcessor package that performs your required preprocessing sanitizing and feature preparation tasks and then splits the data into the training and test datasets.

This option uses the Python SDK and the SageMaker Data Wrangler container, which provides an interactive data preparation interface, in addition to the scikit-learn SKLearnProcessor package for preprocessing and splitting the data. This is a good option if there is a need for interactive data exploration and preparation, but it may not be the most efficient option for automating the data preparation process.

Option D: Create a SageMaker Processing job using a SageMaker Python SDK with Processing container leveraging the Spark PySparkProcessor package that performs your required preprocessing sanitizing and feature preparation tasks and then splits the data into the training and test datasets.

This option uses the Python SDK and the Spark PySparkProcessor package to preprocess the data and split it into the training and test datasets. While Spark can be efficient for processing large datasets, it may not be necessary for the given task, which involves cleaning and preparing relatively small polling data.

Option E: Create a SageMaker Processing job using a SageMaker Python SDK with Processing container leveraging the SparkMLProcessor package that performs your required preprocessing sanitizing and feature preparation tasks and then splits the data into the training and test datasets.

This option uses the Python SDK and the SparkMLProcessor package to preprocess the data and split it into the training and test datasets. While Spark can be efficient for processing large datasets, it may not be necessary for the given task, which involves cleaning and preparing relatively small polling data.

In conclusion, the most efficient options to achieve the data sanitizing and feature preparation tasks are options B and C. Option B is a good option for automating the data preparation process using the popular scikit-learn library, while option C is a good option for interactive data exploration and preparation using the SageMaker Data Wrangler container.