SageMaker Ground Truth: Complete Data Labeling for Machine Learning Models

SageMaker Ground Truth Labeling Process

Question

You have just landed a position as a machine learning specialist at a large financial services firm.

Your new team is working on a fraud detection model using the SageMaker built-in linear learner algorithm.

You are gathering the data required for your machine learning model.

The dataset you intend to produce will contain well over 5,000 objects that need to be labeled.

Your team wants to control the costs of cleaning your data.

Therefore, the team has decided to use SageMaker Ground Truth active learning to automate your data labeling. The Ground Truth automated labeling job initially follows this set of steps: Selects a random sample of data sends the sample data to human workers uses the human-labeled data as validation data runs a SageMaker batch transform using the validation set, which generates a quality metric used to estimate the potential quality of auto-labeling the rest of the unlabeled data runs a SageMaker batch transform on the unlabeled data data, where the expected quality of automatically labeling the data is above the requested level of accuracy, is labeled After performing the above steps, what does Ground Truth do next to complete the labeling of ALL of your data?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

This option doesn't articulate that the selection of a new sample looks for the hardest to identify unlabeled data.

It also doesn't state that the new human-labeled data is used with the existing labeled data to train a new model.

Option B is incorrect.

This option doesn't articulate that the selection of a new sample looks for the hardest to identify unlabeled data.

Option C is incorrect.

This option doesn't state that the new human-labeled data is used with the existing labeled data to train a new model.

Option D is correct.

This is the set of steps Ground Truth uses to iterate over the unlabeled data using human labelers and model training to complete the labeling of your large dataset.

Reference:

Please see the Amazon SageMaker developer guide titled Amazon SageMaker Ground Truth, and the Amazon SageMaker developer guide titled Using Automated Data Labeling.

SageMaker Ground Truth is an AWS service that provides automated data labeling for machine learning models. It uses machine learning algorithms to automate the labeling process, thereby reducing the time and cost of labeling data manually.

In this scenario, the team is using SageMaker Ground Truth active learning to automate the data labeling process for their fraud detection model. The process involves several steps:

  1. Select a random sample of data: SageMaker Ground Truth selects a random sample of data from the dataset.

  2. Send the sample data to human workers: The selected data is sent to human workers for labeling.

  3. Use the human-labeled data as validation data: The human-labeled data is used as validation data to measure the quality of the labels generated by the machine learning algorithm.

  4. Run a SageMaker batch transform using the validation set: SageMaker runs a batch transform using the validation set to generate a quality metric that estimates the potential quality of auto-labeling the rest of the unlabeled data.

  5. Run a SageMaker batch transform on the unlabeled data: SageMaker runs a batch transform on the remaining unlabeled data.

  6. Label the data where expected quality of auto-labeling is above the requested level of accuracy: Data where the expected quality of auto-labeling is above the requested level of accuracy is automatically labeled.

After completing these steps, SageMaker Ground Truth needs to complete the labeling of all data in the dataset. The process to do this is as follows:

A. Select a new sample of unlabeled data and sends it to human workers: SageMaker Ground Truth selects a new sample of unlabeled data and sends it to human workers for labeling.

B. Use the existing labeled data to verify the new human-labeled data: The existing labeled data is used to verify the new human-labeled data.

C. Repeat this later set of steps until all the data in the dataset is labeled: The process of selecting a new sample of unlabeled data, sending it to human workers for labeling, and using existing labeled data to verify the new human-labeled data is repeated until all the data in the dataset is labeled.

Option A describes the correct process for completing the labeling of all data in the dataset using SageMaker Ground Truth active learning. Option B is not correct because it suggests that the existing labeled data and the new human-labeled data are used to train a new model, which is not the case. Options C and D are not correct because they describe a process that involves selecting the most hard to identify unlabeled data, which is not part of the Ground Truth active learning process.