Unsupervised Learning Algorithm for Identifying Shark Populations | NOAA Research

Best Algorithm in SageMaker for Identifying Longitude and Latitude Groupings

Question

You work as a machine learning specialist for the National Oceanic and Atmospheric Administration (NOAA Research)

NOAA has developed a great white shark detection program to help warn shore populations when the sharks are in the area of a populated beach.

You have the assignment to use your machine learning expertise to decide where to place 10 high-tech shark detection sensors on the oceanic floor as part of a pilot to determine if the NOAA invests broadly in these very expensive sensors.

You have great white sightings data from around the globe gathered over the past several years to use your model training and test data.

The model dataset contains several useful features, such as the longitude and latitude of each sighting. You have decided to use an unsupervised learning algorithm that attempts to find discrete groupings within the data.

Specifically, you want to find similarities in the longitude and latitude and find groupings of these.

You need to produce 10 longitude and latitude pairs to determine where to place the sensors. Which algorithm can you use in SageMaker that best suits this task?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answer: C.

Option A is incorrect.

From the Amazon SageMaker developer guide titled Linear Learner Algorithm, “Linear models are supervised learning algorithms used for solving either classification or regression problems.” But you are trying to solve a data clustering problem so that you can find the ten best clustered sightings to determine where to place your shark detection sensors.

Option B is incorrect.

From the Amazon SageMaker developer guide titled Neural Topic Model (NTM) Algorithm, “Amazon SageMaker NTM is an unsupervised learning algorithm that is used to organize a corpus of documents into topics that contain word groupings based on their statistical distribution.” So this algorithm is used for natural language processing, not data clustering.

Option C is correct.

The k-means algorithm is a clustering algorithm.

From the Amazon SageMaker developer guide titled K-Means Algorithm, “K-means is an unsupervised learning algorithm.

It attempts to find discrete groupings within data, where members of a group are as similar as possible to one another and as different as possible from members of other groups.” By setting the k hyperparameter to 10, this algorithm will allow you to find the 10 best groupings of shark sightings worldwide.

Option D is incorrect.

From the Amazon SageMaker developer guide titled Random Cut Forest (RCF) Algorithm, “Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set.” But you are trying to solve a data clustering problem so you can find the ten best clustered sightings to determine where to place your shark detection sensors.

Option E is incorrect.

From the Amazon SageMaker developer guide titled Semantic Segmentation Algorithm, “The Amazon SageMaker semantic segmentation algorithm provides a fine-grained, pixel-level approach to developing computer vision applications.” So the Semantic Segmentation algorithm is used for computer vision applications, but you are trying to solve a data clustering problem.

Option F is incorrect.

The XGBoost algorithm is a gradient boosting algorithm.

From the Amazon SageMaker developer guide titled XGBoost Algorithm, “gradient boosting is a supervised learning algorithm that attempts to accurately predict a target variable by combining an ensemble of estimates from a set of simpler, weaker models.” You are not trying to predict a target value; you are trying to find discrete groupings in your dataset.

Reference:

Please see the Amazon SageMaker developer guide titled Use Amazon SageMaker Built-in Algorithms.

The algorithm that best suits this task is K-Means.

K-Means is an unsupervised machine learning algorithm that is commonly used for clustering tasks. Clustering is the process of grouping together data points based on their similarities, where data points within a cluster are more similar to each other than to data points in other clusters.

In this scenario, the goal is to find similarities in the longitude and latitude of great white shark sightings and group them into clusters. K-Means algorithm can be used to group together these data points based on their geographic location.

The algorithm works by randomly selecting K initial centroids (where K is the number of clusters we want to create), assigning each data point to the closest centroid, and then recalculating the centroids based on the mean of the data points assigned to each cluster. This process is repeated iteratively until the centroids no longer move or until a maximum number of iterations is reached.

Once the algorithm has converged, we can use the location of each centroid as a longitude and latitude pair to determine where to place the sensors. Since we want to place 10 sensors, we would set K=10 in the K-Means algorithm.

SageMaker is a cloud-based service provided by AWS that provides a suite of machine learning tools and algorithms. The K-Means algorithm is one of the many algorithms that can be used in SageMaker.

Therefore, the correct answer is C. K-Means.