AWS Redshift Distribution Design for Big Data Analytics | BDS-C00 Exam Answer

Designing Distribution for AWS Redshift to Optimize Big Data Analytics

Question

Hymutabs Ltd (Hymutabs) is a global environmental solutions company running its operations in in Asia Pacific, the Middle East, Africa and the Americas.

It maintains more than 10 exploration labs around the world, including a knowledge centre, an "innovative process development centre" in Singapore, a materials and membrane products development centre as well as advanced machining, prototyping and industrial design functions. Hymutabs hosts their existing enterprise infrastructure on AWS and runs multiple applications to address the product life cycle management. The datasets are available in Aurora, RDS and S3 in file format.

Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements.

They adapt modeling approaches laid by Bill Inmon and Kimball to efficiently design the solution.

The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records and are working on establishing the best practices to address the design concerns. There are 6 tables that they are currently working on: ORDER_FCT is a Fact Table with billions of rows related to orders SALES_FCT is a Fact Table with billions of rows related to sales transactions.

This table is specifically used to generate reports EOD (End of Day), EOW(End of Week), and EOM (End of Month) and also sales queries ?CUST_DIM is a Dimension table with billions of rows related to customers.

It is a TYPE 2 Dimension table PART_DIM is a part dimension table with billions of records that defines the materials that were ordered DATE_DIM is a dimension table SUPPLIER_DIM holds the information about suppliers the Hymutabs work with One of the key requirements includes ORDER_FCT and PART_DIM are joined together in most of order related queries.

ORDER_FCT has many other dimensions to support analysis. How would you design the distribution? Select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : D.

Option A is incorrect - KEY DISTRIBUTION distributes the rows are according to the values in one column.

Queries initiate lot of redistribution of data of both ORDER_FCT and PART_DIM are not built on same key.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option B is incorrect - ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option C is incorrect - EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table does not participate in joins.

Definitely not a right approach.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option D is correct - KEY DISTRIBUTION distributes the rows are according to the values in one column.

With distribution of data on same key in both the tables, there is no change of redistribution.

This is the best approach to design.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option E is incorrect - EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any.

particular column.

EVEN distribution is appropriate when a table does not participate in joins.

Definitely not a right approach.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

The design of distribution in Amazon Redshift plays a crucial role in determining the performance and scalability of the system. The distribution key defines how data is distributed across the nodes in the Redshift cluster, which impacts how data is accessed and joined. Therefore, it is important to carefully select the distribution style for each table based on the usage patterns and query requirements.

In this scenario, the ORDER_FCT and PART_DIM tables are being used in most of the order related queries, so they need to be distributed on the same key.

Option A is incorrect because it suggests distributing ORDER_FCT and PART_DIM with KEY distribution on their respective primary keys. This would result in data being distributed across the cluster based on the primary keys of each table, making it difficult to join the tables efficiently.

Option B is incorrect because it suggests distributing ORDER_FCT and PART_DIM with ALL distribution on their respective primary keys. This would result in copying the entire contents of both tables to all nodes in the cluster, which can lead to excessive network traffic and memory usage.

Option C is incorrect because it suggests distributing ORDER_FCT and PART_DIM with EVEN distribution on their respective primary keys. This would result in evenly distributing the data across all nodes in the cluster, which can lead to performance issues when joining tables with billions of rows.

Option D is incorrect because it suggests distributing ORDER_FCT and PART_DIM on the same key with KEY distribution. While distributing both tables on the same key can help with query performance, KEY distribution on a single column is not an efficient way to distribute billions of rows of data.

Option E is also incorrect because it suggests distributing ORDER_FCT and PART_DIM on the same key with EVEN distribution. As previously mentioned, distributing both tables with EVEN distribution can lead to performance issues due to uneven distribution of data.

Therefore, the correct answer is to distribute ORDER_FCT and PART_DIM on the same key with a distribution style that is optimized for joining and querying large tables, such as KEY distribution on a common join key. This will ensure that the data is distributed in a way that maximizes query performance and minimizes network traffic.