Direct Marketing Campaign for Receptive Customers

Prepare Data for XGBoost Algorithm

Question

You work for a retail clothing manufacturer that has a very active online web store.

You have been assigned the task of building a model to contact customers for a direct marketing campaign based on their predicted receptiveness to the campaign.

Some of your customers have been contacted in the past for other marketing campaigns.

You don't want to contact these customers who have been contacted in the past for this latest campaign. Before training this model, you need to clean your data and prepare it for the XGBoost algorithm you are going to use.

You have written your cleaning/preparation code in your SageMaker notebook.

Based on the following code, what happens on lines 19, 21, 22? (Select THREE) 1 import sagemaker 2 import boto3 3 from sagemaker.predictor import csv_serializer 4 import numpy as np 5 import pandas as pd 6 from time import gmtime, strftime 7 import os 8 region = boto3.Session().region_name 9 smclient = boto3.Session().client('sagemaker') 10 from sagemaker import get_execution_role 11 role = get_execution_role() 12 bucket = 'sagemakerS3Bucket' 13 prefix = 'sagemaker/xgboost' 14 !wget -N https://.../bank.zip 15 !unzip -o bank.zip 16 data = pd.read_csv('./bank/bank-full.csv', sep=';') 17 pd.set_option('display.max_columns', 500) 18 pd.set_option('display.max_rows', 5) 19 data['no_previous_campaign'] = np.where(data['contacted'] == 999, 1, 0) 20 data['not_employed'] = np.where(np.in1d(data['job'], ['student', 'retired', 'unempl']), 1, 0) 21 model_data = pd.get_dummies(data) 22 model_data = model_data.drop(['duration', 'employee.rate', 'construction.price.idex', 'construction.confidence.idx','lifetime.rate', 'region'], axis=1) 23 train_data, validation_data, test_data = np.split(model_data.sample(frac=1, random_state=1729), [int(0.7 * len(model_data)), int(0.9*len(model_data))]) 24 pd.concat([train_data['y_yes'], train_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('train.csv', index=False, header=False) 25 pd.concat([validation_data['y_yes'], validation_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('validation.csv', index=False, header=False) 26 pd.concat([test_data['y_yes'], test_data.drop(['y_no', 'y_yes'], axis=1)], axis=1).to_csv('test.csv', index=False, header=False) 27 boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train/train.csv')).upload_file('train.csv') 28 boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'validation/validation.csv')).upload_file('validation.csv')

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F. G.

Answers: C, D, F.

Option A is incorrect.

This option describes what happens on line 23, not what happens on lines 20, 21, or 22.

Option B is incorrect.

Line 19 does not set the attribute no_previous_campaign to 999

It sets the attribute no_previous_campaign to 1 or 0 depending on whether the customer in the observation has been contacted via a previous campaign, as indicated by the value 999.

Option C is correct.

Line 19 sets the attribute no_previous_campaign to 1 or 0 provided the customer in the observation has been contacted via a previous campaign, as indicated by the value 999.

Option D is correct.

Line 21 uses the pandas library get_dummies method to convert the categorical attributes in the dataframe to dummy (or indicator) variables.

Option E is incorrect.

Line 21 does not convert empty attributes to dummy variables.

It uses the pandas library get_dummies method to convert the categorical attributes in the dataframe to dummy (or indicator) variables.

Option F is correct.

Line 22 removes (or drops) several features, presumably because you have deemed the features inconsequential to the training of your model.

Option G is incorrect.

Line 22, in this usage, calls the pandas drop method to remove features, not observations.

Reference:

Please see the SciPy numpy.where documentation (for line 19), the pandas get_dummies documentation (for line 21), and the pandas DataFrame.drop documentation (for line 22)

The code provided prepares the data for the XGBoost algorithm by cleaning, transforming, and splitting the dataset into training, validation, and test datasets. The following are the explanations for lines 19, 21, and 22:

Line 19:

kotlin
data['no_previous_campaign'] = np.where(data['contacted'] == 999, 1, 0)

This line creates a new column called no_previous_campaign in the data DataFrame. The value of this column is set to 1 if the customer in the observation has not been contacted via a previous campaign (i.e., the value in the contacted column is 999), and 0 otherwise. Therefore, this line sets the attribute no_previous_campaign to 1 if the customer in the observation has not been contacted via a previous campaign or 0 if they have been contacted via a previous campaign. The np.where() function is used to set the values of the no_previous_campaign column based on the condition mentioned.

Line 21:

kotlin
model_data = pd.get_dummies(data)

This line converts the categorical data in the data DataFrame to a set of indicator variables, also known as dummy variables. Dummy variables are used to represent categorical variables in regression analysis. Each category in a categorical variable is represented by a binary variable (0 or 1). For example, if the job column has three categories (student, retired, and unempl), three new columns (job_student, job_retired, and job_unempl) will be created in the model_data DataFrame. The value of each new column will be 1 if the original column had that category in that row, and 0 otherwise.

Line 22:

javascript
model_data = model_data.drop(['duration', 'employee.rate', 'construction.price.idex', 'construction.confidence.idx','lifetime.rate', 'region'], axis=1)

This line removes the features that are deemed inconsequential from the model_data DataFrame. The features that are removed are duration, employee.rate, construction.price.idex, construction.confidence.idx, lifetime.rate, and region. The drop() function is used to remove the columns from the DataFrame. The axis=1 parameter is used to indicate that the columns are being dropped.