Amazon EMR: Big Data Analytics with Jupyter Notebooks

Quickly Create Jupyter Notebooks in EMR and Attach to Spark Clusters

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS launched EMR cluster to support their big data analytics requirements.

AFS has multiple data sources built out of S3, SQL databases, MongoDB, Redis, RDS, other file systems.

AFS is looking for a component in EMR that allows to quickly create Jupyter notebooks, attach them to Spark clusters, and then open the Jupyter Notebook editor in the console to remotely run queries and code.

Which EMR Hadoop ecosystem fulfills the requirements? select 2 options?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer : B and C.

Option A is incorrect.

Hive is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Hive scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.

Hive enables you to avoid the complexities of writing Tez jobs based on directed acyclic graphs (DAGs) or MapReduce programs in a lower level computer language, such as Java.

Hive extends the SQL paradigm by including serialization formats.

You can also customize query processing by creating table schema that matches your data, without touching the data itself.

In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html

Option B is correct.

EMR Notebooks is a Jupyter Notebook environment built in to the Amazon EMR console that allows you to quickly create Jupyter notebooks, attach them to Spark clusters, and then open the Jupyter Notebook editor in the console to remotely run queries and code.

An EMR notebook is saved in Amazon S3 independently from clusters for durable storage, quick access, and flexibility.

You can have multiple notebooks open, attach multiple notebooks to a single cluster, and re-use a notebook on different clusters.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyter-emr-managed-notebooks.html

Option C is correct.

Jupyter Notebook is an open-source web application that you can use to create and share documents that contain live code, equations, visualizations, and narrative text.

Amazon EMR offers you two options to work with Jupyter notebooks:

EMR Notebook.

JupyterHub.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyter.html

The EMR Hadoop ecosystem that fulfills the requirements of quickly creating Jupyter notebooks, attaching them to Spark clusters, and then opening the Jupyter Notebook editor in the console to remotely run queries and code are:

B. EMR Notebook C. Jupyter Hub

EMR Notebook is a managed notebook environment provided by Amazon EMR that enables data scientists, analysts, and developers to create, edit, and run Jupyter notebooks directly on EMR clusters. With EMR Notebook, users can create notebooks using popular open-source tools such as Apache Zeppelin, Jupyter, and RStudio, and use them to interactively analyze data using popular big data engines such as Apache Spark, Apache Hive, and Presto.

Jupyter Hub is an open-source platform that enables users to create and manage multiple instances of Jupyter Notebook servers on a single machine or a cluster of machines. Jupyter Hub is designed to be scalable and flexible, allowing users to create and manage multiple instances of Jupyter Notebook servers on demand, and to provide a centralized management and control interface for these instances.

Therefore, the correct answer is B. EMR Notebook and C. Jupyter Hub.