Aggregating Clickstream Data: Scalable and Reliable Solution | AWS Certified DevOps Engineer - Professional

Designing a High-Scale Clickstream Data Aggregation System

Prev Question Next Question

Question

You design a service that aggregates clickstream data in batch and delivers reports to subscribers via email only once per week.

Data is extremely spikey, geographically distributed, high-scale, and unpredictable.

How should you design this system?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - B.

When you look at building reports or analyzing data from a large data set, you need to consider EMR because this service is built on the Hadoop framework used to process large data sets.

The ideal approach to getting data onto EMR is to use S3

Since the Data is extremely spikey and geographically distributed, using edge locations via Cloudfront distributions is the best way to fetch the data.

Option A is invalid because RedShift is more of a petabyte storage cluster.

Option C is invalid because having both Kinesis and EMR for the job analysis is redundant.

Option D is invalid because Elastic Search is not an option for processing records.

For more information on Amazon EMR, please visit the below URL:

https://aws.amazon.com/emr/

The best option for designing a service that aggregates clickstream data in batch and delivers reports to subscribers via email only once per week, with the characteristics of being extremely spikey, geographically distributed, high-scale, and unpredictable is Option C:

Use API Gateway invoking Lambdas which PutRecords into Kinesis, and EMR running Spark performing GetRecords on Kinesis to scale with spikes. Spark on EMR outputs the analysis to S3, which are sent out via email.

Option C is the best choice because it offers the most flexibility and scalability for the requirements specified. Here is a breakdown of why this option is the best:

  1. API Gateway: API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. It can handle any type of API traffic and has features such as caching, throttling, and monitoring built-in. API Gateway integrates with a wide range of AWS services, including AWS Lambda, which makes it a great choice for this use case.

  2. Lambda: Lambda is a serverless computing service that lets you run code without provisioning or managing servers. It can scale automatically based on traffic, making it perfect for handling spikes in traffic. In this option, Lambda functions will receive the clickstream data and PutRecords into Kinesis.

  3. Kinesis: Amazon Kinesis is a fully managed service that makes it easy to collect, process, and analyze real-time, streaming data. It can scale up and down automatically to handle spikes in traffic. In this option, Kinesis is used to receive and buffer the clickstream data, which is then processed by Spark on EMR.

  4. EMR: Amazon EMR is a fully managed service that makes it easy to process large amounts of data using open-source tools such as Apache Spark, Apache Hive, and Apache Hadoop. EMR can scale up and down automatically based on the workload, making it perfect for handling spikes in traffic. In this option, Spark on EMR performs GetRecords on Kinesis to process the clickstream data and generate reports, which are then stored in S3.

  5. S3: Amazon S3 is a highly scalable and durable object storage service. It can store and retrieve any amount of data from anywhere on the web. In this option, S3 is used to store the reports generated by Spark on EMR.

  6. Email: Finally, the reports are sent out via email to subscribers. This can be done using a simple email client or a more advanced service like Amazon SES.

In summary, Option C offers the most flexibility and scalability for this use case. It leverages AWS services such as API Gateway, Lambda, Kinesis, EMR, S3, and email to handle the clickstream data and generate reports that are sent out via email.