Orchestrating Big Data Processing with AWS: Apache Hadoop, Cleansing, and S3 Delivery

Using Apache Hadoop for Web Server Log Processing and Delivery to Amazon S3

Question

A company has decided to use AWS for their Big Data processing needs.

Their first assignment has the following requirements Use Apache Hadoop to process web server logs The logs then need to be cleansed Finally, the logs need to be delivered to Amazon S3 Which of the following can be used to orchestrate this process?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

Such a use case is given in the AWS Documentation.

########

ETL Processing Using AWS Data Pipeline and Amazon Elastic MapReduce.

This blog post shows you how to build an ETL workflow that uses AWS Data Pipeline to schedule an Amazon Elastic MapReduce (Amazon EMR) cluster to clean and process web server logs stored in an Amazon Simple Storage Service (Amazon S3) bucket.

AWS Data Pipeline is an ETL service that you can use to automate the movement and transformation of data.

It launches an Amazon EMR cluster for each scheduled interval, submits jobs as steps to the cluster, and terminates the cluster after tasks have completed.

In this post, you'll create the following ETL workflow:

########

Option A is incorrect because this is used when you have Lambda functions to orchestrate.

Option C is incorrect because this is a queue-based service.

Option D is incorrect because this is a serverless compute-based service.

For more information on this use case, please visit the url.

https://aws.amazon.com/blogs/big-data/etl-processing-using-aws-data-pipeline-and-amazon-elastic-mapreduce/
a
P| 7

‘Amazon EMR w/ EMRES

Out of the four options given, AWS DataPipeline is the service that can be used to orchestrate the process of processing web server logs with Apache Hadoop, cleansing the logs, and delivering the logs to Amazon S3.

Here's why:

  1. AWS Step Functions: This service is used to coordinate microservices in a workflow. It is primarily used for building serverless applications and managing multi-step workflows. While it can be used to manage the flow of data, it is not an ideal solution for managing Big Data processing workflows involving Hadoop and Amazon S3.

  2. AWS DataPipeline: This service is used to move data between different AWS services and to schedule and automate data processing activities. It is designed to handle large datasets and complex data processing pipelines. In this case, it can be used to orchestrate the flow of data from the web server logs to Hadoop for processing, then to a data cleansing service, and finally to Amazon S3 for storage.

  3. AWS SQS: This is a messaging service used for decoupling the components of a cloud application. It allows components to communicate asynchronously and can be used to ensure that messages are delivered reliably between components. While it can be used as part of a larger data processing workflow, it is not well suited to orchestrate the entire data processing pipeline.

  4. AWS Lambda: This service is used to run code in response to events and can be used to process data in real-time. While it can be used to process web server logs and to cleanse the data, it is not an ideal solution for managing the entire data processing workflow.

In summary, AWS DataPipeline is the best option for orchestrating the process of processing web server logs with Hadoop, cleansing the logs, and delivering the logs to Amazon S3.