Building an Efficient Data Lake for CloudTrail VPC Flow Logs and Application Load Balancer Logs | AWS Certified Big Data - Specialty Exam Prep

Building an Efficient Data Lake

Question

A company wants to build a data lake which will comprise of the following Logs generated from CloudTrail VPC Flow Logs Logs from Application Load Balancers hosted in AWS They need to stream the logs and also ensure an expansive data lake is available for storage purposes.

Which of the following can be used to fulfil this requirement in the most efficient manner? Choose 2 answers from the options given below.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - A and C.

The AWS Documentation mentions the following.

Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3

This data can be anything-from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others.

It can also be IoT events, game events, and much more.

To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight.

This situation is less than ideal, especially for real-time data that loses its value over time.

Option B is incorrect since this option should not be used for persistence of data.

Option D is incorrect since this would be less appropriate than Kinesis Firehose.

Kinesis Firehose can automatically take the ingestion of data from multiple sources and then directly ingest the data into S3.

For more information on a use case for this, please refer to the below URL.

https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

The most efficient manner to fulfill the requirement of building a data lake consisting of logs generated from CloudTrail, VPC Flow Logs, and logs from Application Load Balancers hosted in AWS, is to use AWS S3 for storage of the data lake and AWS Kinesis Firehose to stream the various files.

Explanation:

AWS S3 is a highly scalable, secure, and durable object storage service that provides virtually unlimited storage space for data of any type, including logs. S3 also provides features such as versioning, lifecycle policies, and encryption that ensure the data is always available, secure, and compliant.

AWS Kinesis Firehose is a fully managed service that can load streaming data into AWS S3, Redshift, and Elasticsearch in real-time. It can capture, transform, and deliver data from various sources such as CloudTrail, VPC Flow Logs, and Application Load Balancers in a reliable, scalable, and cost-effective manner. Kinesis Firehose automatically scales to handle any amount of streaming data and can batch, compress, and encrypt the data before loading it into the destination.

Using AWS S3 for storage of the data lake and AWS Kinesis Firehose to stream the various files has several advantages:

  1. Scalability: AWS S3 and Kinesis Firehose are highly scalable services that can handle any amount of data.

  2. Cost-Effective: AWS S3 and Kinesis Firehose are pay-as-you-go services, which means you only pay for the storage and data transfer you use.

  3. Durability: AWS S3 provides 99.999999999% durability for your data, ensuring that your data is always available and protected against data loss.

  4. Security: AWS S3 and Kinesis Firehose provide multiple layers of security, including encryption at rest and in transit, access control, and audit logging.

Using AWS Kinesis Streams is not the most efficient way to fulfill the requirement because it requires setting up and managing the infrastructure to collect, process, and store the streaming data. AWS Kinesis Streams is a low-level service that provides real-time processing of streaming data, but it requires more development effort and cost than using AWS S3 and Kinesis Firehose.

Therefore, the two most efficient answers to fulfill the requirement are A. Use AWS S3 for storage of the data lake and C. Use AWS Kinesis Firehose to stream the various files.