AWS Glue Components and Functionalities

AWS Glue: Key Components and Functionalities

Question

MSP Bank, Limited is a leading Japanese monetary institution that provides a full range of financial products and services to both institutional and individual customers.

It is headquartered in Tokyo.MSP Bank is hosting their existing infrastructure on on premise DC and AWS and maintains a hybrid environment. MSP Bank hosts multiple web applications, CRM and ERP running on premise while moving storage, compute, DWH and AI running out of AWS.

Also MSP is launching new applications running on AWS environment.

MSP Banks hosts their Development, Testing and Production VPC to maintain different environments and maintains VPN connectivity between on premise DC and AWS. MSP Bank is planning to build a data lake on all the log files stored in S3, captured from different applications running out of on premise and AWS and also identified data sets captured out of CRM, ERP and other Business applications

MSP Bank is looking at AWS Glue to acts as a fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores.

What are the key components and functionalities of AWS Glue? Select 3 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answer: A, B, C.

Option A is correct -AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog.

From there it can be used to guide ETL operations.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

Option B is correct -Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations.

For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

Option C is correct -The AWS Glue Data Catalog is your persistent metadata store.

It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

Option D is incorrect -AWS Glue also lets you set up crawlers that can scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog.

From there it can be used to guide ETL operations.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

Option E is incorrect -The AWS Glue Data Catalog is your persistent metadata store.

It is a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

Option F is incorrect -Using the metadata in the Data Catalog, AWS Glue can autogenerate Scala or PySpark (the Python API for Apache Spark) scripts with AWS Glue extensions that you can use and modify to perform various ETL operations.

For example, you can extract, clean, and transform raw data, and then store the result in a different repository, where it can be queried and analyzed.

https://docs.aws.amazon.com/glue/latest/dg/components-overview.html

AWS Glue is a fully managed extract, transform, and load (ETL) service that provides an easy and cost-effective way to categorize, clean, enrich, and move data between various data stores. It enables the creation of data pipelines that ingest data from a wide range of sources, including on-premises and cloud-based data sources. AWS Glue consists of several key components and functionalities:

  1. Crawlers: Crawlers scan data in all kinds of repositories, classify it, extract schema information from it, and store the metadata automatically in the AWS Glue Data Catalog. The AWS Glue Data Catalog is a persistent metadata store that lets users store, annotate, and share metadata in the AWS Cloud. Crawlers can be configured to run on a schedule or on-demand, and they can detect changes in the underlying data sources and update the metadata accordingly.

  2. AWS Glue Data Catalog: The AWS Glue Data Catalog is a persistent metadata store that lets users store, annotate, and share metadata in the AWS Cloud. The Data Catalog stores metadata such as table definitions, partitioning schemes, and schema versions. It also enables the creation of custom classifiers to support new data types and formats.

  3. AWS Glue Jobs: AWS Glue Jobs are custom-built ETL scripts that are created using the AWS Glue console or the AWS SDK. Jobs are executed on a fully managed Apache Spark environment, and they can be scheduled to run on a recurring basis or triggered manually. AWS Glue Jobs can be used to transform data in a variety of ways, including filtering, aggregating, joining, and pivoting.

In addition to these key components and functionalities, AWS Glue also provides a range of other features, such as job monitoring and debugging, support for custom transformations and user-defined functions, and integration with a variety of AWS services such as Amazon S3, Amazon RDS, and Amazon Redshift. Overall, AWS Glue provides a flexible and scalable ETL solution that can be used to handle a wide range of data processing requirements.