Enable Amazon S3 Consistency Checks

Ensure Cluster Uses Most Recent Data

Question

A company is planning on creating an EMR cluster for their Big Data needs.

To make use of the maximum available space, they want to use Amazon S3 as the underlying data store for the cluster.

Which of the following should also be enabled to ensure the cluster always works with the most recent updated data in S3?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

The AWS Documentation mentions the following.

EMRFS consistent view is an optional feature available when using Amazon EMR release version 3.2.1 or later.

Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS.

Consistent view addresses an issue that can arise due to the Amazon S3 Data Consistency Model.

For example, if you add objects to Amazon S3 in one operation and then immediately list objects in a subsequent operation, the list and the set of objects processed may be incomplete.

This is more commonly a problem for clusters that run quick, sequential steps using Amazon S3 as a data store, such as multi-step extract-transform-load (ETL) data processing pipelines.

Option A is incorrect since this is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Option C is incorrect since this is a web interface for the EMR Cluster.

Option D is incorrect since the question clearly states that the underlying storage layer should be S3

For more information on the EMRFS consistent view, please refer to the below URL.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html

To ensure that an EMR cluster always works with the most recent updated data in S3, we need to enable consistent view. Therefore, the correct answer is B.

When an EMR cluster is launched, it creates a Hadoop Distributed File System (HDFS) instance, which is a distributed file system that stores data across multiple nodes in a cluster. However, HDFS is limited in terms of scalability and durability, and its performance may be affected by certain workloads.

In contrast, Amazon S3 is a highly scalable and durable object store that can be used as the underlying data store for an EMR cluster. By using S3, we can avoid the limitations of HDFS and store vast amounts of data in a cost-effective manner. However, there is a potential issue with using S3 as the underlying data store for an EMR cluster - consistency.

Amazon S3 is an eventually consistent data store, which means that there may be a delay between the time an object is updated and the time it becomes visible to other users. In the context of an EMR cluster, this could mean that some nodes in the cluster may be working with stale data, which could lead to inconsistent results.

To address this issue, we can enable consistent view, which ensures that all nodes in the cluster always work with the most recent updated data in S3. Consistent view achieves this by periodically checking for updates to S3 objects and refreshing the cache of each node in the cluster.

Using Hive along with the EMR cluster (Option A) and enabling Hue on the cluster (Option C) are not related to ensuring consistency between S3 and EMR cluster. Hive is a data warehousing solution that provides a SQL-like interface to Hadoop, while Hue is a web-based user interface for Hadoop.

Using HDFS as the storage layer (Option D) is not necessary when using S3 as the underlying data store for an EMR cluster. In fact, using HDFS along with S3 could lead to consistency issues, since HDFS may not always reflect the latest updates to S3 objects.