Best Practices for Querying Redshift on AWS

Data Modeling for Redshift Tables

Question

Allianz Financial Services (AFS) is a banking group offering end-to-end banking and financial solutions in South East Asia through its consumer banking, business banking, Islamic banking, investment finance and stock broking businesses as well as unit trust and asset administration, having served the financial community over the past five decades. AFS uses Redshift on AWS to fulfill the data warehousing needs and uses S3 as the staging area to host files.

AFS uses other services like DynamoDB, Aurora, and Amazon RDS on remote hosts to fulfill other needs.

The data modeling team is working on designing the tables on Redshift and want to adapt best practices for querying.

Please advice.

select 4 options.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E. F.

Answer : A,B, D,E.

Option A is correct -Amazon Redshift stores your data on disk in sorted order according to the sort key.

The Amazon Redshift query optimizer uses sort order when it determines optimal query plans.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key.

Queries are more efficient because they can skip entire blocks that fall outside the time range.

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key.

Amazon Redshift can skip reading entire blocks of data for that column.

It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.

If you frequently join a table, specify the join column as both the sort key and the distribution key.

Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join.

Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-sort-

Option B is correct - the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations.

The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed.

Distribute the fact table and one dimension table on their common columns.

Choose the largest dimension based on the size of the filtered dataset.

Choose a column with high cardinality in the filtered result set.

Change some dimension tables to use ALL distribution.

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-best-dist-key.html

Option C is incorrect -Let COPY Choose Compression Encodings.

Automatic compression balances overall performance when choosing compression encodings.

Range- restricted scans might perform poorly if sort key columns are compressed much more highly than other columns in the same query.

As a result, automatic compression chooses a less efficient compression encoding to keep the sort key columns balanced with other columns

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-auto-compression.html

Option D is correct -Let COPY Choose Compression Encodings.

Automatic compression balances overall performance when choosing compression encodings.

Range- restricted scans might perform poorly if sort key columns are compressed much more highly than other columns in the same query.

As a result, automatic compression chooses a less efficient compression encoding to keep the sort key columns balanced with other columns

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-use-auto-compression.html

Option E is correct -Define primary key and foreign key constraints between tables wherever appropriate.

Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.

Do not define primary key and foreign key constraints unless your application enforces the constraints.

Amazon Redshift does not enforce unique, primary- key, and foreign-key constraints

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-defining-constraints.html

Option F is incorrect - Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results in better query performance.

Use the DATE or TIMESTAMP data type, depending on the resolution you need, rather than a character type when storing date/time information

https://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-timestamp-date-columns.html

Sure, I can provide you with a detailed explanation of the options for adapting best practices for querying in Redshift:

A. Choose the Best SORT key: The SORT key determines how the data is physically sorted on disk, which impacts the performance of queries. The ideal SORT key would be the column that is most frequently used for filtering and grouping operations in queries. It should also have a high cardinality (i.e., many distinct values) to minimize the number of blocks that need to be read.

B. Choose the Best Distribution Style: The distribution style determines how the data is distributed across the nodes in the Redshift cluster. The ideal distribution style depends on the query patterns and data access patterns. If there are certain columns that are frequently joined, then it would be best to use the same distribution key for those tables. If the query patterns involve aggregation, then it would be best to use the ALL distribution style to minimize the amount of data that needs to be transferred between nodes.

C. Specify compression encodings when table is created: Compression can significantly reduce the amount of storage required for a table and can improve query performance by reducing the amount of I/O required to retrieve data from disk. It is best to specify the compression encoding for each column based on the data type and the distribution of values within that column. The optimal compression encoding can be determined by running ANALYZE COMPRESSION on the table after it is loaded with data.

D. Use Automatic Compression: Redshift provides automatic compression for tables, which determines the best compression encoding for each column based on the data type and distribution of values. This option can be useful if the data is frequently changing or if the optimal compression encoding is difficult to determine.

E. Define primary key and foreign key constraints between tables wherever appropriate, even though they are only informational: Defining primary key and foreign key constraints can improve query performance by allowing Redshift to use more efficient join algorithms. Additionally, these constraints can provide important metadata for understanding the structure of the data.

F. Use CHAR/VARCHAR for Date Columns: It is not recommended to use CHAR or VARCHAR data types for date columns because they can be less efficient for filtering and grouping operations. Instead, the DATE data type should be used for date columns.

In summary, the best practices for querying in Redshift include choosing the best SORT key and distribution style, specifying compression encodings, using primary and foreign key constraints, and using the appropriate data types for columns.