AWS Redshift Table Design for Big Data Analytics | Best Practices for Hymutabs Ltd

Redshift Table Distribution Styles for SALES_FCT and DATE_DIM Tables

Question

Hymutabs Ltd (Hymutabs) is a global environmental solutions company running its operations in in Asia Pacific, the Middle East, Africa and the Americas.

It maintains more than 10 exploration labs around the world, including a knowledge centre, an "innovative process development centre" in Singapore, a materials and membrane products development centre as well as advanced machining, prototyping and industrial design functions. Hymutabs hosts their existing enterprise infrastructure on AWS and runs multiple applications to address the product life cycle management.

The datasets are available in Aurora, RDS and S3 in file format.

Hymutabs Management team is interested in building analytics around product life cycle and advanced machining, prototyping and other functions. The IT team proposed Redshift to fulfill the EDW and analytics requirements.

They adapt modeling approaches laid by Bill Inmon and Kimball to efficiently design the solution.

The team understands that the data loaded into Redshift would be in terabytes and identified multiple massive dimensions, facts, summaries of millions of records and are working on establishing the best practices to address the design concerns. There are 6 tables that they are currently working on: ORDER_FCT is a Fact Table with billions of rows related to orders SALES_FCT is a Fact Table with billions of rows related to sales transactions.

This table is specifically used to generate reports EOD (End of Day), EOW(End of Week), and EOM (End of Month) and also sales queries CUST_DIM is a Dimension table with billions of rows related to customers.

It is a TYPE 2 Dimension table PART_DIM is a part dimension table with billions of records that defines the materials that were ordered DATE_DIM is a dimension table SUPPLIER_DIM holds the information about suppliers the Hymutabs work with SALES_FCT and DATE_DIM are joined together frequently since EOD sales reports are generated every day.

please suggest your distribution style for both tables.

Select 1 option.

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D. E.

Answer : C.

Option A is incorrect -KEY DISTRIBUTION distributes the rows are according to the values in one column.

This is a right approach to design the table, but DATE_DIM with KEY DISTRIBUTION with number of records being very low, lot of data is copied between nodes.

This approach is ok but not a perfect design to build the solution.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option B is incorrect -EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table does not participate in joins.

For a fact table like SALES_FCT, all the nodes participate in all queries even though the EOD reports is only for that particular day.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option C is correct -ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.This is the perfect design for DATE_DIM table which has very low number and can be distributed to all tables.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option D is incorrect -ALL distribution makes a copy of the entire table in every compute node.

Being billion record tables, this is not a right approach to design.Cannot be used for massive table like SALES_FCT.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

Option E is incorrect -EVEN DISTRIBUTION evenly distributes the rows across the slices in a round-robin fashion, regardless of the values in any particular column.

EVEN distribution is appropriate when a table does not participate in joins.

For a fact table like SALES_FCT, all the nodes participate in all queries even though the EOD reports is only for that particular day.

SALES_FCT TABLE need to be designed on a table with a perfect distribution key in mind.

https://docs.aws.amazon.com/redshift/latest/dg/tutorial-tuning-tables-distribution.html

The distribution style of tables in Amazon Redshift is a crucial factor in determining the performance of a data warehouse. It affects how the data is stored, distributed, and retrieved across the nodes of the Redshift cluster. The right distribution style depends on the nature of the data, its size, and the type of queries that will be executed.

In the given scenario, the SALES_FCT and DATE_DIM tables are the focus of attention, and their distribution styles need to be determined.

SALES_FCT is a fact table with billions of rows related to sales transactions. It is used to generate reports EOD, EOW, and EOM, and also sales queries. This table is accessed frequently and contains a large volume of data.

DATE_DIM is a dimension table that holds the information related to date and time. It is also used frequently in conjunction with SALES_FCT to generate EOD sales reports.

Considering the characteristics of these two tables, the best distribution style for SALES_FCT is KEY DISTRIBUTION on its own Primary KEY (one of the columns). This distribution style allows for the data to be distributed across the nodes based on the values in the primary key column. Since SALES_FCT is frequently accessed and contains a large volume of data, distributing it with KEY DISTRIBUTION can help minimize data movement during query execution, which can result in faster query performance.

For DATE_DIM, the best distribution style is EVEN DISTRIBUTION on its PRIMARY KEY. This style ensures that the data is distributed evenly across all the nodes, which can result in better performance for queries that require a join between DATE_DIM and SALES_FCT. Since EOD sales reports are generated every day, the EVEN DISTRIBUTION on DATE_DIM can help ensure that the data is always available for analysis.

Based on the above analysis, the correct option is B: Distribute the SALES_FCT with EVEN DISTRIBUTION on its own Primary KEY (one of the columns) while DATE_DIM is distributed with EVEN distribution on its PRIMARY KEY.