Building Real-Time Data Transformation and Feature Engineering Jobs for Massive Scale Flight Data Processing | SEO Best Practices

Real-Time Data Transformation and Feature Engineering Jobs for Massive Scale Flight Data Processing

Prev Question Next Question

Question

You work as a machine learning specialist for the airline traffic control agency of the federal government.

Your machine learning team is responsible for producing the models that process all air traffic in-flight data to produce recommended flight paths for the aircraft currently aloft.

The flight paths need to consider all of the prevailing conditions (weather, other flights in the path, etc.) that may affect an aircraft's flight path. The data that your models need to process is massive in scale and requires large-scale data processing.

How should you build the data transformation and feature engineering processing jobs so that you can process all of the flight data in real-time?

Answers

A. Run Glue ETL distributed data processing jobs to perform the transformation and feature engineering on the flight data in real-time and save the data to S3 for your model training.

B. Use Kinesis Data Firehose to perform the transformation and feature engineering on the flight data in real-time and save the data to S3 for your model training.

C. Run Apache Spark Streaming data processing jobs to perform the transformation and feature engineering on the flight data in real-time and save the data to S3 for your model training.

D. Use a Kinesis Data Analytics SQL application to perform the transformation and feature engineering on the flight data in real-time and save the data to S3 for your model training.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

Option A is incorrect.

Glue ETL is used for batch processing.

So it will not work in a real-time scenario.

Option B is incorrect.

Kinesis Data Firehose is a near real-time processing service (it buffers your data as it processes it using the buffer size and buffer interval configuration settings)

It will not work in a real-time scenario.

Option C is correct.

Apache Spark Streaming is an analytics engine used for large-scale data processing that runs distributed data processing jobs.

You can apply data transformations and extract features (feature engineering) using the Spark framework.

Option D is incorrect.

Kinesis Data Analytics running a SQL application can't write directly to S3

Also, Kinesis Data Analytics cannot scale to the large-scale data processing capabilities that Apache Spark jobs can.

References:

Please see the Amazon SageMaker developer guide titled Data Processing with Apache Spark (https://docs.aws.amazon.com/sagemaker/latest/dg/use-spark-processing-container.html),

The Amazon SageMaker Examples GitHub repository titled Distributed Data Processing using Apache Spark and SageMaker Processing (https://github.com/aws/amazon-sagemaker-examples/blob/master/sagemaker_processing/spark_distributed_data_processing/sagemaker-spark-processing.ipynb),

The Amazon Kinesis Data Firehose developer guide titled Configure Settings (https://docs.aws.amazon.com/firehose/latest/dev/create-configure.html)

The Amazon Kinesis Data Analytics FAQs (https://aws.amazon.com/kinesis/data-analytics/faqs/)

The scenario describes a requirement for real-time processing of massive-scale data to produce flight paths based on various conditions affecting the aircraft. Therefore, the solution should leverage a distributed data processing approach that can handle real-time data streams and perform feature engineering and transformation.

Option A recommends using AWS Glue, a fully-managed ETL service that can perform distributed data processing and transformation on large-scale datasets. Glue can perform both batch and streaming data processing and can easily integrate with Amazon S3 for storing the transformed data. Therefore, option A seems like a feasible solution.

Option B suggests using Kinesis Data Firehose, which is a managed service that can capture and transform streaming data in real-time. Data Firehose can perform various transformations, such as filtering, aggregation, and format conversion, and can deliver the transformed data to S3, Redshift, or Elasticsearch. However, Data Firehose does not support feature engineering operations like Glue. Therefore, option B is not an appropriate solution for this scenario.

Option C recommends using Apache Spark Streaming, which is an open-source distributed computing framework for processing large-scale data streams. Spark Streaming can process real-time data streams and perform feature engineering operations, such as filtering, transformation, and aggregation, on-the-fly. Spark Streaming can integrate with various data sources, such as Kafka, Flume, and HDFS, and can store the transformed data in S3. Therefore, option C is a viable solution for this scenario.

Option D suggests using Kinesis Data Analytics, which is a managed service for processing real-time data streams using SQL queries. Data Analytics can perform streaming data processing and feature engineering operations like filtering, aggregation, and transformation using SQL queries. Data Analytics can integrate with various data sources, such as Kinesis Data Streams, Firehose, and S3, and can store the transformed data in S3. Therefore, option D is also a feasible solution for this scenario.

In conclusion, all the given options can perform the required real-time processing of massive-scale data and feature engineering operations. However, the most appropriate solution would be option A, as it provides a fully-managed ETL service with distributed data processing capabilities and can perform batch and streaming data processing. Option C and D are also viable solutions, but they require more complex infrastructure and development efforts than option A. Option B is not an appropriate solution, as it does not support feature engineering operations.

Prev Question Next Question