AWS Glue: IoT Sensor Data Transformation with Spark ML | Machine Learning Packages

Transforming IoT Sensor Data with Spark ML in AWS Glue

Question

You work for a retail athletic footwear company.

Your company has just completed the production of a new running shoe that contains IoT sensors in the shoe.

These sensors are used to enhance the runner's running experience by giving detailed data about foot plant, distance, acceleration, gait, and other data points for use in personal running performance analysis. You are on the machine learning team assigned the task of building a machine learning model to use the shoe IoT sensor data to make predictions of shoe life expectancy based on user wear and tear of the shoes.

Instead of just using raw running miles as the predictor of shoe life, your model will use all of the IoT sensor data to produce a much more accurate prediction of the remaining life of the shoes. You are in the process of building your dataset for training your model and running inferences from your model.

You need to clean the IoT sensor data before using it for training or use it to provide inferences from your inference endpoint.

You have decided to use Spark ML jobs within AWS Glue to build your feature transformation code.

Which machine learning packages/engines are the best choices for building your IoT sensor data transformer tasks in the simplest way possible? (Select THREE)

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answers: A, B, C.

Option A is correct.

AWS Glue serializes Spark ML jobs into MLeap containers.

You add these MLeap containers to your inference pipeline.

Option B is correct.

Apache Spark MLlib is a machine learning library that lets you build machine learning pipeline components to transform your data using the full suite of standard transformers such as tokenizers, OneHotEncoders, normalizers, etc.

Option C is correct.

The SparkML Serving Container allows you to deploy an Apache Spark ML pipeline in SageMaker.

Option D is incorrect.

Batch Transformer is a feature of SageMaker that allows you to get inferences for an entire dataset.

Batch Transform is not an Apache SparkML feature.

Option E is incorrect.

There is no Apache SparkML feature called MLTransform.

Option F is incorrect.

There is no Apache SparkML feature called MapReduce.

Reference:

Please see the Amazon SageMaker developer guide titled Feature Processing with Spark ML and Scikit-learn, the MLeap page, the SageMaker SparkML Serving Container GitHub repo, the Apache Spark MLlib overview page, the Apache Spark MLlib docs page titled Extracting, transforming, and selecting features, the Amazon SageMaker developer guide titled Deploy a Model on Amazon SageMaker Hosting Services, and the Amazon SageMaker developer guide titled Get Inferences for an Entire Dataset with Batch Transform.

As an AWS Machine Learning Specialist working for a retail athletic footwear company, you have been tasked with building a machine learning model that uses IoT sensor data to predict the remaining life of the new running shoe. To do this, you need to clean the IoT sensor data before using it for training your model or providing inferences from your inference endpoint. You have decided to use Spark ML jobs within AWS Glue to build your feature transformation code. There are several machine learning packages and engines available for this task, and you need to choose the best ones for the job.

The best choices for building your IoT sensor data transformer tasks in the simplest way possible are:

B. MLlib - This is the machine learning library that is part of the Apache Spark project. It provides various algorithms and tools for data preprocessing, feature engineering, and model training. MLlib is one of the most popular machine learning packages in Spark, and it supports both batch and streaming processing.

D. SparkML Batch Transform - This is a serverless batch processing engine that runs on Amazon SageMaker. It allows you to preprocess and transform your data in parallel, and it can scale automatically to handle large datasets. SparkML Batch Transform also integrates with other AWS services such as S3, Glue, and Athena.

E. MLTransform - This is a high-level data processing API that is part of AWS Glue. It provides a simplified interface for transforming and cleaning data, and it can be used with various data sources such as S3, JDBC, and DynamoDB. MLTransform also supports various machine learning libraries such as Spark ML, XGBoost, and TensorFlow.

Therefore, the correct answers are A. MLlib, D. SparkML Batch Transform, and E. MLTransform.

Option A, MLeap, is incorrect because it is not a part of Spark, but a separate library for serializing and deserializing Spark ML pipelines. Option C, SparkML Serving Container, is incorrect because it is used for deploying Spark ML models to production environments, not for data preprocessing or feature engineering. Option F, SparkML MapReduce, is incorrect because it is an old API for processing data in a distributed manner using MapReduce, which has been replaced by Spark.