EMR SQL Query Engine for Fast Interactive Analytic Queries | AFS Case Study

EMR SQL Query Engine for AFS: Fast Interactive Analytic Queries

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS launched EMR cluster to support their big data analytics requirements.

AFS has multiple data sources built out of S3, SQL databases, MongoDB, Redis, RDS, other file systems.

AFS is looking for fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources Which EMR Hadoop ecosystem fulfills the requirements? select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer : D.

Option A is incorrect - Hive is an open-source, data warehouse, and analytic package that runs on top of a Hadoop cluster.

Hive scripts use an SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions.

Hive enables you to avoid the complexities of writing Tez jobs based on directed acyclic graphs (DAGs) or MapReduce programs in a lower level computer language, such as Java.

Hive extends the SQL paradigm by including serialization formats.

You can also customize query processing by creating table schema that matches your data, without touching the data itself.

In contrast to SQL (which only supports primitive value types such as dates, numbers, and strings), values in Hive tables are structured elements, such as JSON objects, any user-defined data type, or any function written in Java.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html

Option B is incorrect -HBase is an open source, non-relational, distributed database developed as part of the Apache Software Foundation's Hadoop project.

HBase runs on top of Hadoop Distributed File System (HDFS) to provide non- relational database capabilities for the Hadoop ecosystem.

HBase works seamlessly with Hadoop, sharing its file system and serving as a direct input and output to the MapReduce framework and execution engine.

HBase also integrates with Apache Hive, enabling SQL-like queries over HBase tables, joins with Hive-based tables, and support for Java Database Connectivity (JDBC)

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hbase.html

Option C is incorrect -HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications.

HCatalog has a REST interface and command line client that allows you to create tables or do other operations.

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hcatalog.html

Option D is correct -Presto is a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources

https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html

Among the options provided, Apache Presto would be the most suitable choice for AFS to meet their requirements for a fast SQL query engine designed for interactive analytic queries over large datasets from multiple sources in their EMR cluster.

Apache Hive is a data warehousing tool in the Hadoop ecosystem that enables SQL-like queries over Hadoop distributed file system (HDFS) data. However, it has some limitations when it comes to interactive queries over large datasets. Hive is based on MapReduce, which can be slow for ad-hoc queries, and its support for non-HDFS data sources is limited.

Apache HBase is a NoSQL database that provides real-time read/write access to large datasets, but it is not a SQL query engine.

Apache HCatalog is a table and storage management layer for Hadoop that provides a metadata and schema abstraction for data stored in Hadoop. However, it is not a SQL query engine and doesn't directly support queries over non-Hadoop data sources.

Apache Presto, on the other hand, is a distributed SQL query engine that is designed for fast and interactive queries over large and diverse datasets from multiple sources, including Hadoop, Cassandra, MongoDB, and relational databases. Presto's distributed architecture allows it to scale to handle large datasets and query concurrency. It also supports a variety of data formats, including JSON, Parquet, Avro, and ORC, and provides a JDBC driver for easy integration with BI and analytics tools. Therefore, Apache Presto would be the most suitable option for AFS to meet their requirements.