SageMaker Data Wrangler - Target Leakage Metrics for Regression and Classification

SageMaker Data Wrangler | Target Leakage Metrics

Question

You work as a machine learning specialist for an oil refining and exploration company.

You are building a machine learning model to predict the viability of various potential drilling sites around the world.

You have training data with many features for which you are performing feature engineering to ensure you don't have any target leakage.

You plan to use both a regression and a classification model to see which gives you better predictive results.

When using SageMaker Data Wrangler to visualize your target leakage report, which two metrics (for regression and classification) can you use to measure your target leakage? (Select TWO)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Correct Answers: B and C.

Option A is incorrect.

The two metrics used by the SageMaker Data Wrangler target leakage analysis visualization are AOC-ROC and R2

MSE is not used in the target leakage analysis visualization.

Option B is correct.

The two metrics used by the SageMaker Data Wrangler target leakage analysis visualization are AOC-ROC and R2.

Option C is correct.

The two metrics used by the SageMaker Data Wrangler target leakage analysis visualization are AOC-ROC and R2.

Option D is incorrect.

The two metrics used by the SageMaker Data Wrangler target leakage analysis visualization are AOC-ROC and R2

F1 is not used in the target leakage analysis visualization.

Option E is incorrect.

The two metrics used by the SageMaker Data Wrangler target leakage analysis visualization are AOC-ROC and R2

Accuracy is not used in the target leakage analysis visualization.

Reference:

Please see the Amazon SageMaker developer guide titled Analyze and Visualize (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-analyses.html), the Amazon SageMaker developer guide titled Prepare ML Data with Amazon SageMaker Data Wrangler (https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler.html)

Target leakage is a common problem in machine learning models where features that are not available during inference are used to train the model. This can lead to inflated performance metrics during training and poor performance during inference. SageMaker Data Wrangler is a data preparation service provided by Amazon SageMaker that helps to identify and mitigate target leakage.

For regression models, the most common metrics used to measure target leakage are Mean Squared Error (MSE) and R-squared (R2). MSE measures the average of the squared differences between the predicted and actual values of the target variable. R2 is a statistical measure that represents the proportion of the variance in the target variable that can be explained by the independent variables. A high value of R2 indicates that the independent variables are good predictors of the target variable and there is less likelihood of target leakage.

For classification models, the most common metrics used to measure target leakage are Area Under the Receiver Operating Characteristic Curve (AUC-ROC) and F1-score. AUC-ROC measures the ability of the model to distinguish between positive and negative classes. A high AUC-ROC value indicates a good model performance and less likelihood of target leakage. F1-score is a harmonic mean of precision and recall and is used to measure the model's accuracy in predicting both positive and negative classes.

Accuracy is not an appropriate metric to measure target leakage as it only measures the overall proportion of correctly classified instances and does not account for the likelihood of target leakage.

In summary, the two metrics that can be used to measure target leakage in SageMaker Data Wrangler are MSE and R2 for regression models, and AUC-ROC and F1-score for classification models.