AWS Certified Big Data - Specialty: Efficient Implementation Step for Redshift and PostgreSQL Data Sync

Efficient Implementation Step for Redshift and PostgreSQL Data Sync

Question

A company currently has a database cluster setup in Redshift.

They also have an AWS RDS PostgreSQL database setup in place.

A table has been setup in PostgreSQL which stores data based on a timestamp.

The requirement now is to ensure that the data from the PostgreSQL table gets stored into the Redshift database.

For this a staging table has been setup in Redshift.

It needs to be ensured that the data lag between the staging and PostgreSQL tables is not greater than 4 hours.

Which of the following is the most efficient implementation step you would use for this requirement?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer - C.

Such a use case is clearly mentioned in the documentation.

########

For further reading, check out Bob Strahan's blog post Query Routing and Rewrite: Introducing pgbouncer-rr for Amazon Redshift and PostgreSQL post.

RDS PostgreSQL includes two extensions to execute queries remotely.

The first extension is the PostgreSQL foreign-data wrapper, postgres_fdw.

The postgres_fdw module enables the creation of external tables.

External tables can be queried in the same way as a local native table, However, the query is not currently executed entirely on the remote side because postgres_fdw doesn't push down aggregate functions and limit clauses.

When you perform an aggregate query through an external table, all the data is pulled into PostgreSQL for an aggregation step.

This is unacceptably slow for any meaningful number of rows.

The second extension is dblink, which includes a function also called dblink.

The dblink function allows the entire query to be pushed to Amazon Redshift.

This lets Amazon Redshift do what it does best-query large quantities of data efficiently and return the results to PostgreSQL for further processing.

########

Since this is clearly mentioned in the documentation, all other options are invalid.

For more information on this use case, please refer to the below URL.

https://aws.amazon.com/blogs/big-data/join-amazon-redshift-and-amazon-rds-postgresql-with-dblink/

Option A: Create a trigger in the PostgreSQL table to send new data to a Kinesis stream. Ensure the data is transferred from the Kinesis Stream to the staging table in Redshift.

This option involves setting up a trigger in the PostgreSQL table to send new data to a Kinesis stream. The data is then transferred from the Kinesis stream to the staging table in Redshift. This can be an efficient way to ensure that the data lag between the staging and PostgreSQL tables is not greater than 4 hours. However, it is important to ensure that the Kinesis stream is set up properly and that there are no issues with data transfer between the Kinesis stream and the Redshift staging table.

Option B: Create a SQL query that is run every hour to check for new data. Use the query results to send the new data to the staging table.

This option involves setting up a SQL query that is run every hour to check for new data in the PostgreSQL table. The results of the query are then used to send the new data to the staging table in Redshift. While this can be an effective way to ensure that the data is transferred in a timely manner, it may not be the most efficient way to handle this requirement, as the data may be delayed by up to an hour.

Option C: Use the extensions available in PostgreSQL and use the dblink facility

This option involves using the extensions available in PostgreSQL and the dblink facility to transfer data from the PostgreSQL table to the Redshift staging table. This can be an efficient way to handle the requirement, but it is important to ensure that the extensions and dblink facility are set up properly and that there are no issues with data transfer.

Option D: Create a trigger in the PostgreSQL table to send new data to a Kinesis Firehose stream. Ensure the data is transferred from the Kinesis Firehose Stream to the staging table in Redshift.

This option is similar to Option A, but instead of using a regular Kinesis stream, it involves using a Kinesis Firehose stream. This can be an effective way to handle the requirement, as Kinesis Firehose is designed for real-time data streaming and can handle large volumes of data. However, as with Option A, it is important to ensure that the Kinesis Firehose stream is set up properly and that there are no issues with data transfer.

In summary, Options A and D are likely the most efficient ways to handle this requirement, as they involve real-time data streaming using Kinesis or Kinesis Firehose. However, it is important to ensure that these services are set up properly and that there are no issues with data transfer. Option B may be effective for smaller volumes of data, but may not be the most efficient way to handle this requirement. Option C may also be efficient, but may require more setup and configuration than the other options.