Feature Engineering for Optimal Training Runs in Azure ML Studio | Exam DP-100

Feature Engineering for Training Runs in Azure ML Studio

Question

Your training dataset, besides many others, contains the following attributes: row_id, transaction_date, transaction_value.

In order to optimize the training runs, you need to do some feature engineering on these data.

You are using the autoML functionality in Azure ML Studio.

Which actions should you take?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: B.

Option A is incorrect because there is no sense in normalizing an attribute with kind of serial numbers.

The attribute should be dropped instead.

Option B is CORRECT because autoMLs built-in featurization removes the row_id, generates some derived features from transaction_date, fills up missing values with the average of the existing values in the transaction_value.

All this is done automatically.

Option C is incorrect becausewhen featurization is enabled, these transformations are applied automatically by autoML.

Via the ML Studio, there are only limited options to modify featurization settings.

Option D is incorrect because in general, all the featurization can be left for the autoML.

Specifically, removing rows that have missing values will affect the result of the training process, so it is incorrect.

Reference:

The best option for feature engineering depends on the nature of the data and the problem being solved. However, option C seems to be the most appropriate given the available information.

Option A suggests normalizing the row_id values, transforming the transaction_date column using custom Python code, and replacing missing transaction_value with random numbers. Normalizing the row_id values may not be necessary for the model, especially if row_id does not contain any meaningful information. Replacing missing transaction_value with random numbers may introduce noise to the data, which could affect the model's performance. Finally, transforming transaction_date using custom Python code may be unnecessary since Azure AutoML provides built-in functionality to transform date type columns.

Option B suggests leaving all the feature engineering tasks to Azure AutoML. While this option may work in some cases, it may not be the best approach for all problems. For example, if row_id contains meaningful information, it should not be removed.

Option C suggests using the Azure AutoML functionality to drop high cardinality features, generate additional features from date type columns, and impute missing values. Dropping high cardinality features is useful to reduce the dimensionality of the data, which can lead to better performance. Generating additional features from date type columns can provide more information to the model, especially if the date is relevant to the problem. Finally, imputing missing values can help to reduce the impact of missing data on the model's performance.

Option D suggests dropping the row_id column, deriving additional year and month columns from transaction_date using custom Python code, and removing rows with missing transaction_values. Dropping row_id may not be appropriate if it contains meaningful information. Deriving additional year and month columns from transaction_date can be done using Azure AutoML, so custom Python code may not be necessary. Finally, removing rows with missing transaction_values may result in a loss of data, which can negatively impact the model's performance.

In conclusion, option C seems to be the best approach given the available information. However, it is important to note that feature engineering is an iterative process, and the best approach may change as the problem and data evolve.