ZNews - Streaming Platform Artifacts for Analyzing Clickstream Data

Streaming Platform Artifacts for Analyzing Clickstream Data

Question

ZNews, one of the largest media and information companies supporting digital distribution hosts their entire infrastructure on AWS, interested to analyze near real time clickstream events such as readership statistics, impressions, and page views for their 100's of web applications generating 30 TB of data every day, analyze trending content in order to promote cross-platform sharing and increase consumer engagement and use clickstream data to perform data science, develop algorithms, and create visualizations and dashboards. Hosting such a massive scale infrastructure, ZNews is looking at a streaming platform with the following capabilities Ingest data from variety of streaming sources Auto-Scaling, Support delivery to new destinations without diminishing throughput Shall support aggregation and de-aggregation of data Ease of configuration, implementation, management and maintenance Support data standardization in JSON format, record format conversion, and compression What kind of artifacts supports the above requirements? Select 3 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answer : B, D and E.

Option A is incorrect -

Option A explanation does mention that Kinesis streams is not good option for above mentioned use case.

Kinesis Data Streams is not the right platform to fulfil the requirements since Kinesis Data Streams does not provide data transformation and record format conversion.

Besides the customer is looking for near real time response in minutes.

Besides Kinesis Data Streams to collect and process large streams of data records in real time.

You can create data-processing applications, known as Kinesis Data Streams applications.

A typical Kinesis Data Streams application reads data from a data stream as data records.

These applications can use the Kinesis Client Library, and they can run on Amazon EC2 instances.

se Kinesis Data Streams for rapid and continuous data intake and aggregation.

The type of data used can include IT infrastructure log data, application logs, social media, market data feeds, and web clickstream data.

Because the response time for the data intake and processing is in real time, the processing is typically lightweight.

The following are typical scenarios for using Kinesis Data Streams:

Accelerated log and data feed intake and processing.

Real-time metrics and reporting.

Real-time data analytics.

Complex stream processing.

Option B is correct - The KPL is an easy-to-use, highly configurable library that helps you write to a Kinesis data stream.

It acts as an intermediary between your producer application code and the Kinesis Data Streams API actions.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html

Option C is incorrect - A STREAM API does not provide aggregation of data.

It provides collection of data in a single HTTP request.

Streams API, using PutRecords operation sends multiple records to Kinesis Data Streams in a single request.

By using PutRecords, producers can achieve higher throughput when sending data to their Kinesis data stream.

Each PutRecords request can support up to 500 records.

Each record in the request can be as large as 1 MB, up to a limit of 5 MB for the entire request, including partition keys.

Also the platform programmatically supports changing between submissions of single records versus multiple records in a single HTTP request.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html

Option D is correct - Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), and Splunk.

Kinesis Data Firehose can invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations.

You can enable Kinesis Data Firehose data transformation when you create your delivery stream.

Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3

Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON.

If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first.

https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html

Option E is correct - enable Kinesis Data Firehose data transformation when you create your delivery stream.

Amazon Kinesis Data Firehose can convert the format of your input data from JSON to Apache Parquet or Apache ORC before storing the data in Amazon S3

Parquet and ORC are columnar data formats that save space and enable faster queries compared to row-oriented formats like JSON.

If you want to convert an input format other than JSON, such as comma-separated values (CSV) or structured text, you can use AWS Lambda to transform it to JSON first.

https://docs.aws.amazon.com/firehose/latest/dev/data-transformation.html https://docs.aws.amazon.com/firehose/latest/dev/record-format-conversion.html

Option F is incorrect - KCL library provides de-aggregation, enhanced consumers but does not provide Autoscaling.

KCL library needs Connector library to write data into Redshift but ease of maintenance cannot be achieved.

Consumers based on KCL library use push mechanism.

use the record processor support provided by the Kinesis Client Library (KCL) to retrieve stream data in consumer applications.

This is a push model, where you implement the code that processes the data.

The KCL retrieves data records from the stream and delivers them to your application code.

Enhanced Consumers provide scaling for additional destinations but the process is manual and need to have a clear idea of data throuput.

https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html

ZNews, a media and information company, wants to analyze near real-time clickstream events to improve consumer engagement and increase cross-platform sharing. They generate 30 TB of data daily from their 100+ web applications and are hosted entirely on AWS. To achieve their goal, they need a streaming platform that can ingest data from various streaming sources, autoscale, aggregate and de-aggregate data, standardize data, and be easy to configure, implement, manage and maintain. There are several AWS services that can meet their requirements.

A. Kinesis Data Streams:

Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can ingest data from various sources such as web clickstreams, social media feeds, and application logs. Kinesis Data Streams can handle high-velocity data and store the data for up to 7 days, during which the data can be analyzed, processed, and used for real-time dashboards, alerts, and other applications. Kinesis Data Streams also provides auto-scaling capabilities, enabling you to scale up or down based on the volume of data you need to process. Additionally, it supports data standardization in JSON format, record format conversion, and compression.

B. KPL Library:

The Kinesis Producer Library (KPL) is an open-source library that can be used to publish data to Kinesis Data Streams. KPL is optimized for high-throughput and low-latency data streams, making it an ideal choice for ZNews to ingest data from multiple sources at scale.

D. Kinesis Data Firehose:

Amazon Kinesis Data Firehose is another streaming service that can ingest and process real-time data at scale. Kinesis Data Firehose can capture and transform data from various sources, including Kinesis Data Streams, and deliver the data to data stores such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service. Kinesis Data Firehose provides a managed service that handles scaling, management, and delivery of streaming data, making it easy to configure, implement, manage, and maintain.

E. Lambda Blueprints, Record Format Conversion:

AWS Lambda is a serverless compute service that can be used to process and analyze data in real-time. AWS Lambda provides several blueprints and templates that enable you to quickly create functions that can transform and analyze data from Kinesis Data Streams or Kinesis Data Firehose. Lambda can also perform record format conversion, enabling you to convert data from one format to another, such as from JSON to CSV.

F. Kinesis Client Library + Enhanced Customers + Kinesis Connector Library:

The Kinesis Client Library (KCL) is a set of libraries that can be used to build consumer applications that process and analyze data from Kinesis Data Streams. KCL provides several features such as automatic load balancing, checkpointing, and de-aggregation of data. Additionally, KCL provides enhanced customers and a Kinesis Connector Library that can be used to integrate Kinesis Data Streams with other AWS services such as Amazon S3, Amazon Redshift, or Amazon Elasticsearch Service.

In summary, the AWS services that can meet ZNews' streaming platform requirements are Kinesis Data Streams, KPL Library, Kinesis Data Firehose, Lambda Blueprints, Kinesis Client Library, Enhanced Customers, and Kinesis Connector Library.