Importing Oracle Data into Amazon EMR Cluster | BDS-C00 Exam Answer

Importing Oracle Data into Amazon EMR Cluster

Question

A company has an on-premise data store in Oracle.

They need to import the data into an Amazon EMR Cluster which uses HDFS.

Which of the following can be used to fulfil this requirement?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

#######

Migrate RDBMS or On-Premise data to EMR Hive, S3, and Amazon Redshift using EMR - Sqoop.

This blog post shows how our customers can benefit by using the Apache Sqoop tool.

This tool is designed to transfer and import data from a Relational Database Management System (RDBMS) into AWS - EMR Hadoop Distributed File System (HDFS), transform the data in Hadoop, and then export the data into a Data Warehouse (e.g.

in Hive or Amazon Redshift).

To demonstrate the Sqoop tool, this post uses Amazon RDS for MySQL as a source and imports data in the following three scenarios:

Scenario 1 - AWS EMR (HDFS -> Hive and HDFS)

Scenario 2 - Amazon S3 (EMFRS), and then to EMR-Hive.

Scenario 3 - S3 (EMFRS), and then to Redshift.

#######

Option A is incorrect since this is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Option C is incorrect since this is an open-source, web-based, graphical user interface for use with Amazon EMR and Apache Hadoop.

Option D is incorrect since this is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text.

For more information on a use case that uses Apache Sqoop, please refer to the below URL.

https://aws.amazon.com/blogs/big-data/migrate-rdbms-or-on-premise-data-to-emr-hive-s3-and-amazon-redshift-using-emr-sqoop/

The correct answer is B. Apache Sqoop.

Explanation: Apache Sqoop is a tool designed to transfer bulk data between Apache Hadoop and structured data stores such as relational databases. Sqoop uses connectors to communicate with data sources and supports connectors for several databases, including Oracle.

In this scenario, the company needs to import data from an on-premise Oracle database into an Amazon EMR Cluster, which uses HDFS. Apache Sqoop can be used to transfer data from the Oracle database into HDFS in the EMR Cluster.

Here's how it works:

  1. Install Apache Sqoop on the Amazon EMR Cluster.

  2. Use Sqoop to create a connection to the Oracle database using the Oracle connector.

  3. Use Sqoop to import data from the Oracle database into HDFS in the EMR Cluster.

  4. The data is now available for processing using tools like Apache Hive, Apache Pig, or Spark.

Let's look at the other answer options and see why they are not correct for this scenario:

A. Apache Hive is a data warehouse infrastructure that provides data summarization, query, and analysis. It is not designed for bulk data transfer from relational databases to HDFS.

C. Apache Hue is a web-based interface that provides a graphical user interface to interact with Hadoop. It does not have any functionality for importing data from relational databases.

D. Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. While it can be used for data analysis, it is not designed for bulk data transfer from relational databases to HDFS.

Therefore, the correct answer is B. Apache Sqoop.