Outlier Detection Algorithm for Novel User Entries | Best Algorithm for Census Data | AWS Certified Machine Learning Exam MLS-C01

Outlier Detection Algorithm

Question

You are a machine learning specialist working for a government agency that uses a series of web application forms to gather citizen data for census purposes.

You have been tasked with finding novel user entries as they are entered by your citizens.

A novel user entry is defined as an outlier compared to the established set of citizen entries in your datastore. You have cleaned your citizen datastore to remove any existing outliers.

You now need to build a model to determine whether new entries on your web application are novel.

Which algorithm best fits these requirements?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

The Multinomial Naive Bayes algorithm is best suited for classification tasks where you wish to know the frequency of a given observation.

You are trying to determine whether you have a novel observation.

Option B is incorrect.

The Bernoulli Naive Bayes algorithm is used in classification tasks where you wish to know whether a known class appears in your observation.

You are trying to determine whether you have a novel observation.

Option C is incorrect.

The Principal Component Analysis algorithm is used to reduce feature dimensionality.

You are trying to determine whether you have a novel observation.

Option D is correct.

The Support Vector Machine algorithm can be used when your training data has no outliers, and you want to detect whether a new observation is a novel entry.

Reference:

Please see the SciKit Learn page titled 1.4

Support Vector Machines (https://scikit-learn.org/stable/modules/svm.html), the SciKit Learn page titled 2.7

Novelty and Outlier Detection (https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection), and the Amazon SageMaker developer guide titled Principal Component Analysis (PCA) Algorithm (https://docs.aws.amazon.com/sagemaker/latest/dg/pca.html)

Option A, Multinomial Naive Bayes, is typically used for text classification tasks where the data is represented as a bag of words or a frequency count of words in a document. It assumes that the frequency of each word is independent of the frequency of every other word in the document. This makes it less suitable for outlier detection tasks, where there may be no clear relationship between the features.

Option B, Bernoulli Naive Bayes, is also used for text classification tasks, but it assumes that the input data is binary, i.e., a word is either present or absent in the document. Like Multinomial Naive Bayes, it is not ideal for outlier detection tasks.

Option C, Principal Component Analysis (PCA), is a dimensionality reduction technique that is often used to reduce the number of features in a dataset. PCA works by finding the directions in the data that contain the most variance, and projecting the data onto those directions. While PCA can be used for outlier detection, it does not classify data points as outliers or not outliers directly. Rather, it looks for points that are far from the mean of the data, which may or may not be outliers.

Option D, Support Vector Machine (SVM), is a powerful classification algorithm that is often used for outlier detection tasks. SVM works by finding the optimal hyperplane that separates the data into different classes. In outlier detection tasks, the goal is to identify the data points that are furthest from the hyperplane, as these are the most likely to be outliers.

Given the requirements of the task, the best algorithm is likely option D, Support Vector Machine. SVM has been used in a variety of outlier detection tasks, and is particularly suited to high-dimensional data. By finding the optimal hyperplane that separates the data into different classes, SVM can identify outliers that are far from the hyperplane. SVM can also be adapted to work with different types of data, including continuous and categorical data.