Selecting Relevant Columns for Efficient ML Model Training | DP-100 Exam Preparation

Select Columns Transform Filter Based Feature Selection Apply Transformation Permutation Feature Importance (PFI)

Question

You have to train a ML model on your dataset consisting of a large number of columns.

Based on your experience, you anticipate long and expensive training runs.

In order to improve the time- and cost-efficiency of your work, you want to decrease the amount of input data by removing columns of little relevance.

ML Designer offers several modules to use in your pipeline: Select Columns Transform Filter Based Feature Selection Apply Transformation Permutation Feature Importance (PFI) Which designer modules should you include, in what order?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: B.

Option A is incorrect because while you actually need these three modules, Filter Based Feature Selection must be the first in the sequence because it holds the logic to calculate the relevance of features, i.e.

the sequence is incorrect.

Option B is CORRECT because in order to filter out irrelevant columns from your dataset before the model is created, you need to use filter based feature selection by choosing a statistical measure which is calculated for each feature and used to determine their relevance.

You will then use only the columns with the best scores for the best efficiency.

You then deed to add Select Columns Transform to generate a dynamic set of columns, then Apply Transformation.

Option C is incorrect because Filter Based Feature Selection and PFI cannot be mixed this way.

Not the right modules selected.

Option D is incorrect because If you want to filter out columns from your dataset before the model is created, you need to use filter based feature selection.

Permutation Feature Importance can be used to generate a set of feature scores after the model has been trained, to calculate feature importance afterwards.

PFI uses a trained model and a dataset as inputs, while FPFS uses a dataset (typically train data) as input.

Hence, using PFI is incorrect.

Reference:

When training a machine learning model, it is often beneficial to remove any columns of data that are not useful for the prediction task. This reduces the amount of data that needs to be processed, which can save time and cost during the training phase. In this scenario, there are four ML Designer modules to consider in order to accomplish this task: Select Columns Transform, Filter Based Feature Selection, Apply Transformation, and Permutation Feature Importance (PFI).

The first step in this process should be to use the Select Columns Transform module, which allows you to select only the columns that you want to keep for the rest of the pipeline. This will help you reduce the amount of data that needs to be processed, which can lead to faster and more cost-effective training runs. Therefore, the correct first module in the pipeline is 1 - Select Columns Transform.

Next, you can use the Filter Based Feature Selection module, which automatically removes any columns that have little relevance to the prediction task. This module works by ranking each column based on its correlation with the target variable and removing those that fall below a certain threshold. This can be useful in cases where you have a large number of columns and it is difficult to manually determine which ones are relevant. Therefore, the correct second module in the pipeline is 2 - Filter Based Feature Selection.

After removing the columns that are not relevant, you can use the Apply Transformation module to apply any necessary transformations to the remaining data. This may include standardizing the data, encoding categorical variables, or performing other pre-processing steps. This step is important to ensure that the data is in a format that can be used by the machine learning algorithm. Therefore, the correct third module in the pipeline is 3 - Apply Transformation.

Finally, you can use the Permutation Feature Importance (PFI) module to measure the importance of each column to the prediction task. This module works by randomly permuting the values in each column and measuring the effect on the model's accuracy. Columns that have a large impact on the model's accuracy are considered more important, and those with a smaller impact can be removed from the data set. However, this step is optional, as it may not always be necessary or cost-effective to perform this additional step. Therefore, the correct fourth module in the pipeline is not required in this case.

Therefore, the correct answer is A - 1, 2, 3.