Transforming Real-time CSV Data to Parquet for AWS Analytics | Machine Learning Specialist

Transforming Real-time CSV Data to Parquet for AWS Analytics

Question

You work as a machine learning specialist for a mobile network operator who builds an analytics platform to analyze and optimize its operations by leveraging machine learning. You receive your data from source systems that send data in CSV format in real-time.

You require to transform the data to the parquet format before storing it on S3

From there, you plan to use the data in SageMaker AutoPilot to help you find the best machine learning pipeline for your analytics problem. Which option solves your data analysis machine learning problem in the most efficient manner?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: D.

Option A is incorrect.

Implementing the solution using MSK on EC2 instances requires more work than the other options.

Option B is incorrect.

Unless you are using Glue streaming ETL, which is not explicitly stated in the question, you should not use Glue on streaming data.

Option C is incorrect.

This option also requires more work (spinning up an EMR cluster) than simply using Kinesis Data Streams, Kinesis Data Firehose, and Lambda (all managed services).

Option D is CORRECT.

You can ingest the streaming CSV data using Kinesis Data Streams, a managed service.

Then use Kinesis Data Firehose and Lambda (also both managed services) to first convert the data from CSV to JSON (Kinesis Data Firehose transformation requires that the data be in the JSON format) then use the Kinesis Data Firehose parquet transformation to convert the data to parquet.

Reference:

Please see the AWS blog titled Stream Real-Time Data in Apache Parquet or ORC Format Using Amazon Kinesis Data Firehose.

Please refer to the Kinesis Data Firehose developer guide titled Converting Your Input Record Format in Kinesis Data Firehose.

The most efficient option for solving the mobile network operator's machine learning problem is option C: Ingest the CSV data using Spark Structured Streaming in an EMR cluster. Convert data to the parquet format using Spark.

Option C is the best option because it leverages the power of Spark, a fast and scalable distributed computing framework that can process large amounts of data in real-time. Spark Structured Streaming can ingest the CSV data in real-time, which is important because the source systems are sending data in real-time. EMR cluster provides an easy way to set up and manage Spark clusters.

Parquet format is a columnar storage format that is optimized for use in big data environments. It is designed to be highly efficient for reading and writing large amounts of data, which makes it a good choice for storing data that will be used in machine learning. Spark has built-in support for converting data to the parquet format, which makes it easy to transform the CSV data to the desired format.

Option A, ingesting the CSV data using MSK running on EC2 instances and using Kafka Connect S3 to convert the data to the parquet format, is not the best option because it requires more setup and maintenance than option C. MSK and Kafka Connect S3 are additional services that need to be configured, which adds complexity to the solution.

Option B, ingesting the CSV data using Kinesis Data Streams and converting the data to the parquet format using Glue, is not the best option because Glue is not as efficient as Spark for data processing. Glue is a managed ETL service that is useful for transforming and cleaning data, but it may not be able to handle the large amounts of data in real-time that the mobile network operator is dealing with.

Option D, ingesting the CSV data using Kinesis Data Streams and using Kinesis Data Firehose, leveraging a Lambda function to transform the data from CSV to JSON, then convert the data to the parquet format, is not the best option because it adds unnecessary complexity to the solution. Converting the data from CSV to JSON and then to the parquet format requires additional processing and storage, which may not be necessary for the mobile network operator's use case.