Tiger Capital: EMR Cluster Storage Configuration

EMR Cluster Storage Configuration

Question

Tiger Investments (TI) is a private equity trust manager specializing in border market investments.

The Group is considered a pioneer investor in Southeast Asia's Greater Sub-region and the Caribbean.

Tiger Capital creates private equity funds targeting pre-emerging, post-conflict or post-disaster economies that are undergoing transition and are poised for rapid growth.

The funds invest commercially in basic businesses, targeting attractive economic and social returns.

Tiger Capital invests through a diversity of financial instruments including equity, and debt TI is planning to launch EMR cluster to complement their ETL workloads running on Data Pipeline. The team is looking for storage configuration that supports storing temporary data that is continually changing, such as buffers, caches, scratch data, and other temporary content.

Select 2 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: C,D.

Option A is incorrect -Provides Ephemeral storage can be enabled through HDFS.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option B is incorrect - Provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option C is correct -Each node is created from an EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store.

Data on instance store volumes persists only during the life of its EC2 instance.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option D is correct -This is same as above defined in option.

C.

The local file system refers to a locally connected disk.

When you create a Hadoop cluster, each node is created from an Amazon EC2 instance that comes with a preconfigured block of pre-attached disk storage called an instance store.

Data on instance store volumes persists only during the lifecycle of its Amazon EC2 instance.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Option B and C are the correct answers for this question.

Explanation:

The scenario describes a need for a storage configuration that supports storing temporary data that is continually changing, such as buffers, caches, scratch data, and other temporary content. Based on this requirement, the following options can be considered:

A. HDFS Storage launched on master and core nodes with storage reclaimed when the cluster ends This option is not suitable because HDFS is designed for permanent storage and not for storing temporary data that is continually changing. HDFS is a distributed file system that is designed to store large files across multiple nodes in a cluster. HDFS is optimized for batch processing workloads and is not ideal for storing temporary data.

B. EMRFS implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3 EMRFS (Amazon EMR File System) is an implementation of HDFS that allows reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS is designed to provide high-performance access to data stored in S3 and supports features such as consistent view, metadata caching, and directory caching. EMRFS is suitable for storing temporary data that is continually changing as it allows storing data on S3, which is a highly scalable and durable storage service.

C. Master and Core nodes running on EC2 that comes with a preconfigured block of preattached disk storage called an instance store EC2 instances come with a preconfigured block of preattached disk storage called an instance store. Instance store provides temporary storage for the instance and is ideal for storing temporary data that is continually changing. Instance store is designed for high performance and low latency and provides local storage that is directly attached to the instance.

D. Master and Core nodes running on local file system or local connected disks. This option is not suitable because local file systems or local connected disks are not highly available or durable. Local storage is not replicated across multiple nodes, and data loss can occur if the node fails. Local storage is suitable for temporary storage that does not require high durability or availability.

In summary, option B and C are the correct answers because they provide highly available and durable storage for storing temporary data that is continually changing. EMRFS allows storing data on S3, which is a highly scalable and durable storage service, while instance store provides temporary storage that is directly attached to the instance and is designed for high performance and low latency.