Search Engine for Public Legal Documents: Ingestion and Optimization

Flexible and Cost-Efficient Design for Ingesting and Searching 150TB Dataset

Prev Question Next Question

Question

Your company has landed a contract to build a search engine of public legal documents.

The dataset is around 150TB in size and is available at the customer's data center in various formats.

Part of the dataset is stored in tapes, and the other is stored in disks.

Some of the dataset is very old, dated to nearly 15 years back, and stored in the compressed format to save on the disk space.

The management has assigned you the task to come up with a flexible and cost-efficient design to ingest the data and make it available for the front-end application to search through efficiently.

Which two sequential steps should you choose to accomplish this final task?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Correct Answer: A and B.

Option A is CORRECT because Snowball Edge Storage Optimized devices can be used to transfer a large amount of data to AWS Data Centres for one-time transfer or periodic transfer.

Option B is CORRECT because AWS Batch will allow you to run custom code to decompress the data and finally save the output to OpenSearch via the Kinesis Firehose.

Option C is INCORRECT because it will use the standard internet line available to the company and take up a huge amount of time to migrate the data.

Option D is INCORRECT because RDS might not be the right database for this data size and the requirement to search through it.

Option E is INCORRECT because the Direct Connect setup will not be cost-effective for a one-time transfer.

https://docs.aws.amazon.com/opensearch-service/latest/developerguide/integrations.html https://aws.amazon.com/blogs/compute/orchestrating-an-application-process-with-aws-batch-using-aws-cloudformation/ https://aws.amazon.com/blogs/developer/orchestrating-an-application-process-with-aws-batch-using-aws-cdk/

To accomplish the task of ingesting the legal documents data and making it available for the front-end application to search through efficiently, the following sequential steps can be followed:

Step 1: Transfer data to AWS S3

The first step is to transfer the dataset to AWS S3. Since the dataset is available at the customer's data center in various formats and part of it is stored in tapes, the best way to transfer the data to AWS S3 is by setting up a VPN connection and transferring the data over the weekend. This ensures that the data is securely transferred to AWS S3 without affecting the customer's daily operations.

Option C is the correct answer: Set up a VPN connection and transfer the data to AWS S3 over the weekend.

Step 2: Process and analyze data using AWS Batch and Amazon OpenSearch Service

The next step is to process and analyze the data using AWS Batch and Amazon OpenSearch Service. AWS Batch can be used to process the data from S3 and send it to Kinesis Firehose. Kinesis Firehose can then be used to save the processed data to Amazon OpenSearch Service. This will allow the front-end application to search through the data efficiently.

Option B is the correct answer: Configure the AWS Batch to process the data from S3, send it to Kinesis Firehose, and save it to Amazon OpenSearch Service.

Step 3: Alternative option - Use Snowball Edge Storage Optimized devices

An alternative to transferring the data over a VPN connection is to use Snowball Edge Storage Optimized devices to migrate the data to S3. This option is useful if the customer's data center has limited bandwidth and cannot transfer the data over a VPN connection efficiently. Snowball Edge Storage Optimized devices can be used to transfer the data securely to S3.

Option A is an alternative option: Set up the Snowball Edge Storage Optimized devices to migrate the data to S3.

Step 4: Alternative option - Use EFS and RDS

Another alternative is to load the data to EFS and create Auto Scaling EC2 instances to read through the data and save it into the AWS RDS for querying. This option is useful if the data needs to be accessed frequently and requires low latency. However, this option may not be cost-effective as the dataset is around 150TB in size.

Option D is not the best answer as it is not cost-efficient: Load the data to EFS and create Auto Scaling EC2 instances to read through the data and save it into the AWS RDS for querying.

Step 5: Alternative option - Use Direct Connect

Another alternative is to set up a Direct Connect connection to transfer the data from on-premise servers to S3. Direct Connect provides a dedicated network connection from the customer's data center to AWS, which ensures high-speed data transfer and reduces network costs. However, this option may not be feasible if the customer's data center is located far from an AWS Direct Connect location.

Option E is not the best answer as it may not be feasible for all customers: Set up a Direct Connect connection to transfer the data from on-premise servers to S3.

In conclusion, the best sequence of steps to accomplish the task of ingesting the legal documents data and making it available for the front-end application to search through efficiently is:

Step 1: Set up a VPN connection and transfer the data to AWS S3 over the weekend. Step 2: Configure the AWS Batch to process the data from S3, send it to Kinesis Firehose, and save it to Amazon OpenSearch Service.