AWS Certified Big Data - Specialty: Installing Hive on EMR Cluster and Persisting Hive Metastore

How to Install Hive on an EMR Cluster and Persist Hive Metastore

Question

A team is building an EMR Cluster and also wants to install Hive as an application on the EMR cluster.

They need the hive metastore to persist even after the EMR Cluster is terminated.

Which of the following can help fulfil this requirement? Choose 2 answers from the options given below.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - A and C.

The AWS Documentation mentions the following.

By default, Hive records metastore information in a MySQL database on the master node's file system.

The metastore contains a description of the table and the underlying data on which it is built, including the partition names, data types, and so on.

When a cluster terminates, all cluster nodes shut down, including the master node.

When this happens, local data is lost because node file systems use ephemeral storage.

If you need the metastore to persist, you must create an external metastore that exists outside the cluster.

You have two options for an external metastore:

AWS Glue Data Catalog (Amazon EMR version 5.8.0 or later only).

Amazon RDS or Amazon Aurora.

For more information on hive metastore, please refer to the below URL.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-metastore-external-hive.html

The Hive metastore stores the metadata about the tables and databases created in Hive. When a Hive query is executed, it consults the metastore to understand the table's schema and other details. By default, Hive stores its metastore in a local Derby database, which is not recommended for production use as it can result in data loss.

To persist the Hive metastore even after the EMR cluster is terminated, we need to use an external database or storage service. The options available for this are:

A. Create a MySQL RDS database to store the metastore records: Amazon Relational Database Service (RDS) is a managed database service that provides an easy way to set up, operate, and scale a relational database in the cloud. We can create a MySQL RDS database and configure Hive to use this as the metastore. This will ensure that the metastore records are persisted even if the EMR cluster is terminated.

B. Create a DynamoDB table to store the metastore records: Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. We can create a DynamoDB table and configure Hive to use this as the metastore. This will ensure that the metastore records are persisted even if the EMR cluster is terminated.

C. Modify the JDBC configuration in the hive-site.xml file in the configuration for the cluster: We can modify the hive-site.xml file in the configuration for the cluster to specify an external database as the metastore. We need to update the JDBC connection string, username, and password in the hive-site.xml file to point to the external database. This will ensure that the metastore records are persisted even if the EMR cluster is terminated.

D. Modify the JDBC configuration in the Hue configuration setup: Hue is a web-based user interface for Hadoop that provides a graphical interface to interact with the Hadoop ecosystem. We can modify the JDBC configuration in the Hue configuration setup to point to an external database as the metastore. This will ensure that the metastore records are persisted even if the EMR cluster is terminated.

Therefore, the correct answers are A and B - creating a MySQL RDS database or a DynamoDB table to store the Hive metastore records.