Adding SQL Analytics to Kinesis Streams Architecture

Performing SQL Queries on Live Data for Analytical Purposes

Question

A team is currently making use of Kinesis streams for streaming web clicks for an application.

There is now a requirement to enable a data analyst team to perform SQL queries on the live data for analytical purposes.

Which of the following can be added to the architecture to achieve this requirement?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - A.

An example of this is given in the AWS documentation.

Options B and C are incorrect as these do not give dynamic options for using SQL queries for analysis.

Option D is incorrect since AWS RDS should be used to host OLTP database.

For more information on this use case, please refer to the below URL.

https://aws.amazon.com/blogs/big-data/querying-amazon-kinesis-streams-directly-with-sql-and-spark-streaming/
What if you could use your SQL knowledge to discover patterns directly from an incoming stream of data? Streaming
analytics is a very popular topic of conversation around big data use cases. These use cases can vary from just
accumulating simple web transaction logs to capturing high volume, high velocity and high variety of data emitted from
billions of devices such as Internet of things. Most of these introduce a data stream at some point into your data
processing pipeline and there is a plethora of tools that can be used for managing such streams. Sometimes, it comes
down to choosing a tool that you can adopt faster with your existing skillset.

In this post, we focus on some key tools available within the Apache Spark application ecosystem for streaming analytics.
This covers how features like Spark Streaming, Spark SQL, and HiveServer2 can work together on delivering a data
stream as a temporary table that understands SQL queries.

The best option to achieve this requirement is to create an EMR Cluster with Spark, stream the data from Kinesis streams to Spark, and use Spark to perform the queries. Here's why:

A. Create an EMR Cluster with Spark. Stream the data from Kinesis streams to Spark. Use Spark to perform the queries.

This option is the most suitable for the requirement, as Spark provides an efficient and flexible way to process large amounts of data and perform complex analytics. With Spark, the data can be ingested from Kinesis streams, and then processed and queried in real-time. The Spark cluster can be configured to run in a streaming mode that constantly processes data as it comes in from Kinesis streams.

B. Use the KCL library to directly perform the SQL queries on the incoming data.

This option is not recommended, as the KCL (Kinesis Client Library) is not designed for SQL querying, but rather for consuming and processing data from Kinesis streams. While it is possible to use a third-party SQL library to query the data, this approach may not be as efficient or scalable as using Spark.

C. Embed the SQL queries while developing the application using the KPL Library.

The KPL (Kinesis Producer Library) is designed for publishing data to Kinesis streams, and not for querying data. Therefore, this option is not recommended.

D. Use the Data Pipeline service to transfer the data to AWS RDS. Use normal SQL queries for the analysis.

While this option may work, it adds an unnecessary step of transferring the data to an RDS database before it can be queried. This can add latency and increase costs. Additionally, the live nature of the data may not be preserved if it is stored in a database before querying.

In summary, the best option to achieve the requirement of enabling a data analyst team to perform SQL queries on live data from Kinesis streams is to create an EMR cluster with Spark and stream the data from Kinesis streams to Spark for processing and querying.