AWS Certified Machine Learning - Specialty: Input Data Channel Specifications for Unsupervised Learning with CSV Files on S3

Input Data Channel Specifications for Unsupervised Learning with CSV Files on S3

Question

You work as a machine learning specialist for a robotics manufacturer where you are attempting to use unsupervised learning to train your robots to perform their prescribed tasks.

You have engineered your data and produced a CSV file and placed it on S3. Which of the following input data channel specifications are correct for your data?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

The Content-Type of text/csv without specifying a label_size is used when you have target data, usually in column one, since the default value for label_size is 1, meaning you have one target column.

(See the Amazon SageMaker developer guide titled Common Data Formats for Training)

Option B is incorrect.

The boundary content type is not relevant to CSV files.

It is used for multipart form data.

Option C is incorrect.

For unsupervised learning, the label_size should equal 0, indicating the absence of a target.

(See the Amazon SageMaker developer guide titled Common Data Formats for Training)

Option D is correct.

For unsupervised learning, the label_size equals 0, indicating the absence of a target.

(See the Amazon SageMaker developer guide titled Common Data Formats for Training)

Reference:

Please see the Amazon SageMaker developer guide, specifically Common Data Formats for Built-in Algorithms and Common Data Formats for Training.

The correct answer for the input data channel specification depends on the format of the data stored in the CSV file.

Option A: Metadata Content-Type is identified as text/csv This specification is correct if the data in the CSV file is in plain text format, where each row represents an observation and each column represents a feature. In unsupervised learning, the goal is to identify patterns in the data without any labels or predefined categories. Hence, the label_size parameter is not relevant in this case.

Option B: Metadata Content-Type is identified as application/x-recordio-protobuf;boundary=1 This specification is used for binary data format, where the data is stored in a RecordIO format that uses Protocol Buffers. This format is useful for storing large datasets, as it compresses the data and reduces the storage space. However, in unsupervised learning, there is no need for a label, hence the label_size parameter is not relevant.

Option C: Metadata Content-Type is identified as application/x-recordio-protobuf;label_size=1 This specification is also used for binary data format, where the data is stored in a RecordIO format that uses Protocol Buffers. However, in this case, the label_size parameter is set to 1, which indicates that each data point in the dataset has a label associated with it. This specification is not correct for unsupervised learning, as unsupervised learning does not require any labels.

Option D: Metadata Content-Type is identified as text/csv;label_size=0 This specification is similar to option A, but with the addition of the label_size parameter, which is set to 0. This indicates that there are no labels associated with the data in the CSV file. This specification is correct for unsupervised learning, as unsupervised learning does not require any labels.

Therefore, the correct answer is option D: Metadata Content-Type is identified as text/csv;label_size=0.