Real Estate Property Foreclosure Prediction with SageMaker Linear Learner Algorithm

Clean and Format Data for SageMaker Linear Learner Algorithm

Question

You work as a machine learning specialist at a firm that runs a web application that allows users to research and compare real estate properties worldwide.

You are working on a property foreclosure model to predict potential price drops.

You have decided to use the SageMaker Linear Learner algorithm.

Here is a small sample of the data you'llhave to work with: | Type | Bedrooms | Area | Solar_Rating | Price | Foreclosed | | condo |2| 2549 | H | 125400| N| | house |4| 4124 | M | 250250| Y| | house |3| 3250 | | 200000| N| | condo |1| 900 | N |90250 | N| | condo |2|?| L| 125400| Y| In order to feed this data into your model, you will first need to clean and format your data. Which of the following SageMaker built-in scikit-learn library transformers would you use to clean and format your data? Select 4.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F. G.

Answers: C, D, F, and G.

Options A is incorrect.

From the scikit-learn API Reference, the StandardScaler transformer is used to Standardize features by removing the mean and scaling to unit variance.

The OrdinalEncoder transformer would be the better choice for this feature since H > M > L > N.

Therefore, this feature has ordinal values.

Option B is incorrect.

The OneHotEncoder transforms nominal categorical features and creates new binary columns for each observation.

The Area feature holds numerical or quantitative data, which does not need to be transformed.

Option C is correct.

The Solar_Rating and Area features have missing data in some observations.

From the scikit-learn API Reference: the SimpleImputer transformer is used to complete missing values.

Option D is correct.

The Type feature is a good candidate for the OneHotEncoder transformer since the Type feature holds a limited number of categorical types.

The OneHotEncoder transforms nominal categorical features and creates new binary columns for each observation.

Option E is incorrect.

From the scikit-learn API Reference: the OrdinalEncoder transformer encodes categorical features as an integer array.

This encoder does not complete missing values.

Option F is correct.

From the scikit-learn API Reference: the OrdinalEncoder transformer encodes categorical features as an integer array that maintains the ordinal nature of the data.

Since H > M > L > N, this feature has ordinal values.

Option G is correct.

The Foreclosed feature holds one of two choices, either a ‘Y' or ‘N'

Therefore, this feature is a good candidate for the LabelBinarizer.

From the scikit-learn API Reference: the LabelBinarizer transformer binarizes label in a one-versus-all fashion.

Option H is incorrect.

From the scikit-learn API Reference: the MinMaxScaler transformer transforms features by scaling each feature to a given range.

The Foreclosed feature has binary data: either ‘Y' or ‘N'

So it is better suited to the LabelBinarizer transformer.

Reference:

Please see the Amazon SageMaker developer guide titled Use Scikit-learn with Amazon SageMaker, and the scikit-learn API Reference.

To clean and format the data for use with the SageMaker Linear Learner algorithm, we need to preprocess the data using various scikit-learn transformers.

Here is the breakdown of the data preprocessing steps we need to take:

  1. Encode categorical features
  2. Impute missing values
  3. Scale numerical features

The data contains both categorical and numerical features, so we will use a combination of different transformers to preprocess the data.

Here are the transformers we would use:

  1. OneHotEncoder to encode categorical features
  2. SimpleImputer to impute missing values
  3. StandardScaler or MinMaxScaler to scale numerical features

Let's go through each of the answer choices and see which transformers are appropriate:

A. StandardScaler to encode the Solar_Rating feature StandardScaler is used to scale numerical features, but it cannot be used to encode categorical features like Solar_Rating. Therefore, A is not a correct answer.

B. OneHotEncoder to encode the Area feature OneHotEncoder can be used to encode categorical features, but Area is a numerical feature, so it should not be encoded using OneHotEncoder. Therefore, B is not a correct answer.

C. SimpleImputer to complete the missing values in the Solar_Rating and Area features This is a correct answer. SimpleImputer can be used to impute missing values in numerical features like Area, and also in categorical features like Solar_Rating.

D. OneHotEncoder to encode the Type feature This is a correct answer. OneHotEncoder can be used to encode categorical features like Type.

E. OrdinalEncoder to complete the missing values in the Solar_Rating and Area features OrdinalEncoder is used to encode categorical features as numerical values, but it cannot be used to impute missing values. Therefore, E is not a correct answer.

F. OrdinalEncoder to encode the Solar_Rating feature OrdinalEncoder can be used to encode categorical features as numerical values, but it should not be used to encode Solar_Rating because the values in this feature have no natural order. Therefore, F is not a correct answer.

G. LabelBinarizer to encode the Foreclosed feature LabelBinarizer is used to encode binary categorical features like Foreclosed, but it cannot be used to encode non-binary categorical features. Therefore, G is not a correct answer.

H. MinMaxScaler to encode the Foreclosed feature. MinMaxScaler is used to scale numerical features, but it cannot be used to encode categorical features like Foreclosed. Therefore, H is not a correct answer.

Therefore, the correct answers are C, D, and two other choices that were not listed in the question. Based on the preprocessing steps we outlined above, the appropriate transformers to use are:

  1. OneHotEncoder to encode categorical features
  2. SimpleImputer to impute missing values
  3. StandardScaler or MinMaxScaler to scale numerical features

We could use StandardScaler to scale numerical features, or MinMaxScaler to scale them to a fixed range. The choice between the two depends on the specific requirements of the model and the characteristics of the data.