AWS Big Data - Specialty Exam: Data Ingestion and Consumer Mechanisms for ProcessFin | AWS Certified BDS-C00

Data Ingestion and Consumer Mechanisms for ProcessFin

Question

ProcessFin, A financial services company uses a multi-shard Kinesis streams is used to ingest database streams of applications running on Aurora databases, process and load data into Redshift which provides the DWH layer.

The size of different data records from different data elements varies between 5 KB to 5 MB (de-normalized records, complex business transaction) of data.

Based on the amount of the data ingested, submission of multiple vs single records programmatically in a single HTTP request is needed to maintain the throughput.

The processing of data and loading into Redshift needs to be done using a pull model where the data is directly accessed from the shards of the stream. Which data ingestion and consumer mechanism mentioned below could fulfil the above requirement? Select 2 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answer : B, E.

Option A is incorrect - The limits of the data ingested by KPL library is capped at 1000 records/sec and 1 MB throughput per shard.The maximum size of a user record is capped at 1 MB.

Though KPL supports both aggregation and collection, there is a limitation with the amount of data that is being processed.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html

Option B is correct - Streams API, using PutRecords operation sends multiple records to Kinesis Data Streams in a single request.

By using PutRecords, producers can achieve higher throughput when sending data to their Kinesis data stream.

Each PutRecords request can support up to 500 records.

Each record in the request can be as large as 1 MB, up to a limit of 5 MB for the entire request, including partition keys.

Also the platform programmatically supports changing between submissions of single records versus multiple records in a single HTTP request.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-sdk.html

Option C is incorrect - Kinesis Agent is a stand-alone Java software application that offers an easy way to collect and send data to Kinesis Data Streams.

The agent continuously monitors a set of files and sends new data to your stream.

The agent handles file rotation, checkpointing, and retry upon failures.

It delivers all of your data in a reliable, timely, and simple manner.

https://docs.aws.amazon.com/streams/latest/dev/writing-with-agents.html

Option D is incorrect -Consumers based on KCL library use push mechanism.

use the record processor support provided by the Kinesis Client Library (KCL) to retrieve stream data in consumer applications.

This is a push model, where you implement the code that processes the data.

The KCL retrieves data records from the stream and delivers them to your application code.

https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html

Option E is correct - The Kinesis Data Streams API provides the getShardIterator and getRecords methods to retrieve data from a stream.

This is a pull model, where your code draws data directly from the shards of the stream.

https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html

Option F is incorrect - The limits of the data ingested by KPL library is capped at 1000 records/sec and 1 MB throughput per shard.The maximum size of a user record is capped at 1 MB.

Though KPL supports both aggregation and collection, there is a limitation with the amount of data that is being processed.

https://docs.aws.amazon.com/streams/latest/dev/developing-producers-with-kpl.html

ProcessFin, a financial services company, needs to ingest database streams of applications running on Aurora databases, process the data, and load it into Redshift for their data warehouse layer. The data records from different data elements vary in size from 5 KB to 5 MB. To maintain the required throughput, multiple vs single records programmatically in a single HTTP request is needed. The processing of data and loading into Redshift needs to be done using a pull model where the data is directly accessed from the shards of the stream.

To fulfill these requirements, the following data ingestion and consumer mechanisms can be used:

  1. Kinesis Data Streams API for data ingestion: Kinesis Data Streams API is a managed service provided by AWS that enables real-time data streaming. Using the API, developers can easily create and manage data streams, and publish and consume data records. The API can be used to ingest data records from various sources, including databases, sensors, and application logs. In this scenario, the Kinesis Data Streams API can be used to ingest the data records from Aurora databases into Kinesis streams. The API provides a high level of control over the ingestion process, allowing developers to control the rate at which records are ingested and the size of the records. It can also support ingesting multiple records programmatically in a single HTTP request to maintain the required throughput.

  2. Consumers based on KCL Library: Kinesis Client Library (KCL) is a set of libraries that simplifies the process of building scalable and fault-tolerant applications that consume and process data records from Kinesis streams. KCL provides a way to consume data records from the shards of the stream, process them, and store the processed data in a data store such as Redshift. KCL provides automatic load balancing and checkpointing features that make it easy to scale the application to handle large volumes of data. In this scenario, consumers based on KCL Library can be used to process the data records from the Kinesis streams and load them into Redshift.

Option A: KPL Library with data aggregation, for data ingestion: Kinesis Producer Library (KPL) is a set of libraries that makes it easy to produce data records to Kinesis streams. KPL provides features such as data aggregation and batching, which can be used to optimize the ingestion process and reduce the number of requests made to Kinesis streams. However, in this scenario, data aggregation may not be suitable due to the varying size of data records, and therefore, KPL with data aggregation is not the best option.

Option C: Kinesis Agent for data ingestion: Kinesis Agent is a pre-built Java application that simplifies the process of ingesting data from various sources into Kinesis streams. Kinesis Agent can monitor specified directories or files and automatically ingest data records into Kinesis streams. However, in this scenario, the data is coming directly from Aurora databases, and therefore, Kinesis Agent is not the best option.

Option E: Consumers based on Kinesis Data Streams API: Kinesis Data Streams API can be used to build consumers that can consume data records from Kinesis streams. However, using this option to build consumers may not be the best approach because it requires more development effort than using KCL, which already provides many features such as automatic load balancing and checkpointing. Therefore, consumers based on Kinesis Data Streams API is not the best option.

Option F: KPL Library with data collection, for data ingestion: Kinesis Producer Library (KPL) with data collection is not a valid option as there is no such feature provided by KPL.