Model Evaluation Techniques for Logistic Regression - MLS-C01 Exam

Discovering Optimal Classification Thresholds for Logistic Regression Models

Question

You are a machine learning specialist on a software development team in a real estate company.

Your management team has asked your team to build a logistic regression model that your company wishes to use to predict whether or not a person will buy a given listing based on multiple attributes of the sale, the property, and the customer profile.

Your team lead has assigned your team to find the optimal model with an ideal classification threshold. Which model evaluation technique should your team use to discover how different classification thresholds will affect the model's performance?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

The Rand Index evaluation technique is used for optimizing unsupervised models, not supervised models like logistic regression models.

Option B is incorrect.

The RMSE evaluation technique is used for regression problems where you are solving for a continuous variable.

In this use case, you are solving a binary classification: will or will not buy.

Option C is incorrect.

The MAE evaluation technique is also used for regression problems where you are solving for a continuous variable.

In this use case, you are solving a binary classification: will or will not buy.

Option D is CORRECT.

The Receiver Operating Characteristic curve evaluation technique is used for regression problems where you are solving for a binary variable.

In this use case, you are solving a binary classification: will or will not buy.

Reference:

Please see the Google Machine Learning Crash Course article titled Classification: ROC Curve and AUC.Please refer to the Towards Data Science article titled Understanding AUC - ROC Curve.

Please review the Data Institute article titled Choosing the Right Metric for Evaluating Machine Learning Models - Part 1.

The correct answer is D. Receiver Operating Characteristic (ROC) curve.

Explanation: The ROC curve is a graphical representation of the performance of a classification model that plots the true positive rate (TPR) against the false positive rate (FPR) at different classification thresholds. TPR is also known as sensitivity, recall, or hit rate, while FPR is also known as the fall-out or false alarm rate.

The ROC curve is an ideal model evaluation technique for discovering how different classification thresholds will affect the model's performance. The area under the ROC curve (AUC) is a common metric for measuring the overall performance of a binary classification model. The AUC ranges from 0 to 1, where 0.5 indicates a random guess, and 1 indicates a perfect classifier. The higher the AUC, the better the performance of the model.

In logistic regression, the model outputs probabilities of class membership for each instance, which can be converted into binary predictions using a classification threshold. By varying the threshold, you can trade off between TPR and FPR, which is particularly useful when you need to balance the cost of false positives and false negatives in your problem domain.

In the context of the real estate company, the logistic regression model could predict whether or not a person will buy a given listing based on multiple attributes of the sale, the property, and the customer profile. For example, the model could predict the probability that a customer will buy a property based on factors such as the price, location, square footage, number of bedrooms, age of the property, income, credit score, etc. The ROC curve could help the team find the optimal classification threshold that maximizes the AUC and balances the trade-off between TPR and FPR for their specific problem domain.