Optimizing Your Machine Learning Environment and Data Ingestion | AWS Certified Machine Learning - Specialty Exam Preparation

Building a Machine Learning Environment and Data Ingestion Solution using SageMaker Studio

Question

You are a machine learning specialist working for an online retail shopping site.

Your machine learning team is responsible for building out a machine learning environment using SageMaker Studio to make possible the running of models used to predict online sales and product pipeline optimization.

Your team also needs to optimize the data ingestion solution into your data lake that is the primary source for your machine learning models.

Your ingestion solution will also facilitate analytics (real-time and interactive analytics of historical data), clickstream analysis, as well as product recommendations.

Which option best meets your team's requirements?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: D.

Option A is incorrect.

You cannot use Athena as a data catalog.

You either need to use the Glue data catalog or Apache Hive as the data catalog.

Option B is incorrect.

The combination of Kinesis Data Streams and Kinesis Data Analytics is better suited to near-real time analytics than it is for historical data insights.

Also, this option does not address your near-real time analytics requirement.

Option C is incorrect.

You cannot use Athena as a data catalog.

You either need to use the Glue data catalog or Apache Hive as the data catalog.

Also, Kinesis Data Firehose alone cannot give you clickstream analysis.

Option D is correct.

Glue is the correct choice for your data catalog, using the Glue data catalog.

Kinesis Data Streams combined with Kinesis Data Analytics satisfies your near-real time analytics requirement.

Kinesis Data Firehose to ElasticSearch satisfies your clickstream requirement, and EMR uses spark jobs to satisfy your recommendation requirement at scale.

References:

Please see the Amazon Kinesis Data Analytics for SQL Applications Developer Guide SQL developer guide titled What Is Amazon Kinesis Data Analytics for SQL Applications? (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/what-is.html),

Amazon Kinesis Data Analytics for SQL Applications Developer Guide SQL developer guide titledAmazon Kinesis Data Analytics for SQL Applications: How It Works (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html),

The AWS Glue developer guide titled Populating the AWS Glue Data Catalog (https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html),

The AWS Quick Start reference article titled Clickstream Analytics on AWS (https://aws.amazon.com/quickstart/architecture/clickstream-analytics/)

The option that best meets the machine learning team's requirements is C: Use Athena as the data catalog of your data lake files, use Kinesis Data Streams and Kinesis Data Analytics to generate near-real-time data insights, use Kinesis Data Firehose for clickstream analytics, and use Glue to create personalized product recommendations.

Here's why:

Data Catalog: The team needs a data catalog to manage metadata and facilitate querying data in the data lake. Athena provides this functionality by creating a schema for the data files stored in S3 and making it easy to query data using SQL.

Data Ingestion: Kinesis Data Streams is a service for ingesting and processing real-time streaming data. It can be used to stream data from the website, mobile app, or any other source of streaming data. Kinesis Data Analytics can be used to analyze the data in near-real-time, perform aggregations, filtering, and calculations.

Clickstream Analytics: Kinesis Data Firehose can be used to ingest clickstream data from the website and deliver it to ElasticSearch. This allows the team to perform real-time analytics on clickstream data using Kibana dashboards.

Product Recommendations: Glue can be used to create personalized product recommendations. The team can use Glue to extract, transform, and load data into a target database, and then use that data to train and deploy machine learning models that generate product recommendations.

Overall, option C provides a comprehensive solution for the team's requirements, with each component designed to handle a specific aspect of the data pipeline. Using Athena as a data catalog, Kinesis for data ingestion, and Glue for machine learning, the team can build a scalable, performant, and robust machine learning environment.