Building Efficient Data Ingestion for Your AWS Data Lake | Machine Learning Specialist Guide

Efficient Data Transformation for Ingesting Real-Time Securities Pricing Data | AWS ML Exam Solution

Question

You work as a machine learning specialist for a financial services organization.

Your machine learning team is responsible for building models that predict index fund tracking errors for the various funds managed by your mutual fund portfolio management department.

You need to ingest data into your data lake for use in your machine learning models.

The required securities pricing data come from varying sources that deliver the data you need to use in your model inferences in near real-time.

You need to perform data transformation, such as compression, of the data before writing it to your S3 data lake.

Which option gives you the most efficient solution for ingesting the data into your data lake?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: D.

Option A is incorrect.

Kinesis Data Analytics needs to be fed the streaming data by either Kinesis Data Streams or Kinesis Data Firehose.

Kinesis Data Analytics cannot ingest data directly.

Also, Apache Flink can write your data to S3 using the streaming file sink, but it writes in the AVRO and Parquet formats, not GZIP.

Option B is incorrect.

The solution described in this option will technically work.

However, it is much less efficient than using Kinesis Data Firehose to ingest, compress using Lambda, and write your data to S3.

Option C is incorrect.

You can ingest your pricing data using Kinesis Data Firehose and use lambda to compress your data into the GZIP format.

However, you should leverage the Kinesis Data Firehose capability to write your data directly to your S3 bucket.

This is more efficient than writing your own code in your Lambda function to write the data to S3.

Option D is correct.

Ingesting the data using Kinesis Data Firehose, using Lambda to compress the data into the GZIP format, and then having Kinesis Data Firehose write your data to S3 is a very common example of using Kinesis Data Firehose for a very efficient data ingestion solution.

References:

Please see the Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility AWS Whitepaper titled Data Ingestion Methods (https://docs.aws.amazon.com/whitepapers/latest/building-data-lakes/data-ingestion-methods.html),

The Investopedia page titled Tracking Error (https://www.investopedia.com/terms/t/trackingerror.asp#:~:text=Tracking%20error%20is%20the%20difference,and%20its%20corresponding%20risk%20level.),

The Apache Flink developer guide titled Streaming File Sink (https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/streamfile_sink.html),

The Amazon Kinesis Data Streams product page titled Getting started with Amazon Kinesis Data Streams (https://aws.amazon.com/kinesis/data-streams/getting-started/),

The Amazon Kinesis Data Analytics for SQL Applications Developer Guide SQL developer guide titled Amazon Kinesis Data Analytics for SQL Applications: How It Works (https://docs.aws.amazon.com/kinesisanalytics/latest/dev/how-it-works.html)

The most efficient solution for ingesting the data into the data lake in this scenario would be option D, which involves using Kinesis Data Firehose to ingest the data and a Lambda function to compress the data before writing it to the S3 data lake.

Here is a more detailed explanation of why this option is the best choice:

Kinesis Data Firehose is a fully managed service that makes it easy to prepare and load streaming data into data stores and analytics tools. With Kinesis Data Firehose, you don't need to write any application code or manage any infrastructure. You simply configure your data producers to send data to Kinesis Data Firehose, and the service takes care of the rest.

In this case, the required securities pricing data comes from varying sources that deliver the data in near real-time. Kinesis Data Firehose is a good fit for this use case because it can handle high volumes of streaming data and can deliver that data to S3 in near real-time. Additionally, Kinesis Data Firehose can automatically transform the data into the desired format, such as compressing the data using GZIP.

Option D suggests using a Lambda function to compress the data into the GZIP format before writing it to the S3 data lake. This approach is ideal because it offloads the data transformation step from the Kinesis Data Firehose service, which can focus on delivering the data to S3 as quickly as possible. The Lambda function can be triggered by Kinesis Data Firehose as soon as new data arrives, and it can quickly compress the data before sending it on to S3.

Option A suggests using a Kinesis Data Analytics application with Apache Flink to compress the data into GZIP format before writing it to S3. This approach could work, but it adds an unnecessary layer of complexity to the data ingestion process. Kinesis Data Analytics is designed for performing complex real-time analytics on streaming data, which is not necessary for this use case. Additionally, using Apache Flink to compress the data requires more setup and management than using a simple Lambda function.

Option B suggests using Kinesis Data Streams with Kinesis Producer Library (KPL) and Kinesis Client Library (KCL) applications to ingest and compress the data. While this approach could work, it is more complex than using Kinesis Data Firehose and requires managing EC2 instances to run the KPL and KCL applications. This adds more overhead and complexity to the data ingestion process.

Option C suggests using Kinesis Data Firehose with a Lambda function to compress the data before writing it to S3. This approach is similar to option D but suggests having the Lambda function write the data to S3 instead of having Kinesis Data Firehose write the data. While this could work, it is less efficient than option D because it requires the Lambda function to handle the S3 write operation in addition to the data transformation. This could slow down the overall data ingestion process.