Preparing Data for Amazon Machine Learning - Best Practices

Preparing Data for Amazon Machine Learning

Question

A company is planning on using the Machine Learning service to perform a predictive analysis.

There are various input files which will be used and submitted to Machine Learning.

How should you prepare the data to ensure it can be used as Input data for Amazon Machine Learning? Choose 2 answers from the options given below.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B and C.

The AWS Documentation mentions the following.

Input data is the data that you use to create a datasource.

You must save your input data in the comma-separated values (.csv) format.

You can provide your input to Amazon ML as a single file, or as a collection of files.

Collections must satisfy these conditions:

All files must have the same data schema.

All files must reside in the same Amazon Simple Storage Service (Amazon S3) prefix, and the path that you provide for the collection must end with a forward slash ('/') character.

Since this is clearly mentioned in the AWS Documentation , all other options are incorrect.

For more information on the data format for Amazon Machine Learning, please refer to the below URL.

https://docs.aws.amazon.com/machine-learning/latest/dg/understanding-the-data-format-for-amazon-ml.html

For using Amazon Machine Learning service, data needs to be prepared in a format that can be easily consumed by the service. This means that the data needs to be in a structured format and should have the same schema across all input files. The answer options are:

A. Ensure that all input files are in JSON format: Amazon Machine Learning supports CSV and JSON formats for input data, so this option could be correct, but it is not the only valid format. CSV is a more common format used for machine learning.

B. Ensure that all input files are in csv format: CSV (Comma-Separated Values) is a commonly used format in machine learning and is supported by Amazon Machine Learning. This format is easy to read and parse, and can be used with a wide range of tools and platforms.

C. Ensure that all input files have the same data schema: The input data must have the same schema across all input files. This means that each input file must have the same fields, with the same data types, in the same order. This is important because the machine learning algorithm will use the schema to understand and interpret the input data.

D. Ensure that all input files have the different data schema's: This option is incorrect as having different data schema's in input files will cause issues with data interpretation and analysis by the machine learning algorithm.

Therefore, the correct answers are B and C. Input data should be in CSV format and should have the same data schema across all input files to ensure that the machine learning algorithm can process the data effectively.