AWS Data Lake and ETL Jobs for On-Premise and Cloud Environments | BDS-C00 Exam Answer

Data Lake Creation and ETL Orchestration in AWS | BDS-C00 Exam Answer

Question

A company has data stores both on their on-premise and AWS Environments.

They need to first create a data lake in AWS and then orchestrate several ETL jobs.

Which of the following can be used to fulfil this requirement? Choose 2 answers from the options given below.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - A and C.

The AWS Documentation mentions the following.

Extract, transform, and load (ETL) operations collectively form the backbone of any modern enterprise data lake.

It transforms raw data into useful datasets and, ultimately, into actionable insight.

An ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption.

The sources and targets of an ETL job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3) buckets.

Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS.

You can also use AWS Step Functions and AWS Lambda for orchestrating multiple ETL jobs involving a diverse set of technologies in an arbitrarily-complex ETL workflow.

Option B is incorrect since AWS S3 will be a more viable option for a data lake.

Option D is incorrect since this is a messaging service.

For more information on a use case for this, please refer to the below URL.

https://aws.amazon.com/blogs/big-data/orchestrate-multiple-etl-jobs-using-aws-step-functions-and-aws-lambda/

The correct answers are A and C.

A. Use AWS S3 for storage of the data lake: Amazon S3 (Simple Storage Service) is a highly durable, scalable, and secure object storage service offered by AWS. It is ideal for storing and retrieving large amounts of unstructured data such as photos, videos, log files, and backups. S3 can be used as a central repository to store data from various sources in a data lake architecture. It also provides versioning, lifecycle policies, and encryption features, making it a suitable storage solution for a data lake. Moreover, S3 integrates with other AWS services like AWS Glue, AWS Lambda, and Amazon EMR, which can be used for data processing, analytics, and machine learning.

C. Use a combination of AWS Lambda and Step Functions: AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. It can be used to perform data transformations, aggregations, and filtering tasks in real-time or batch mode. AWS Step Functions is a serverless workflow service that enables you to coordinate multiple AWS services and Lambda functions into a serverless workflow. It provides visualization, monitoring, and error handling capabilities to orchestrate complex ETL workflows. By combining Lambda and Step Functions, you can create a flexible and scalable ETL pipeline that can handle various data sources and transformations.

B. Use AWS EMR for storage of the data lake: Amazon EMR (Elastic MapReduce) is a managed Hadoop framework that simplifies big data processing by automating cluster provisioning, scaling, and monitoring. It provides a set of pre-configured big data tools such as Hive, Pig, Spark, and HBase that can be used for data processing and analysis. However, EMR is not a storage service, but rather a data processing service. It can read data from S3 or HDFS (Hadoop Distributed File System) and write the results back to S3 or HDFS. Therefore, using EMR for storage of the data lake is not a suitable option.

D. Use SQS for the ETL jobs: Amazon SQS (Simple Queue Service) is a fully managed message queuing service that enables you to decouple and scale distributed systems and microservices. It can be used to store messages (or tasks) that need to be processed asynchronously by multiple workers. However, using SQS for ETL jobs is not a recommended approach, as it does not provide any built-in ETL functionality. SQS can be used to trigger ETL jobs, but the actual data processing and transformation logic must be implemented using other services like Lambda or EMR.