Dedicated SQL Pool - Data Engineering on Microsoft Azure | Exam DP-203

Most Suitable Distribution Column for Hash Distributed Table

Question

You work for a famous electronics company that is starting an online store for devices.

You are using a dedicated SQL pool.

Your team members have a statement that will create a hash distributed table as follows.

CREATE TABLE [dbo].[OnlineSales] ( [ProductID]intNOT NULL , [DateKey]intNOT NULL , [CustomerID] intNOT NULL , [DiscountCode]intNOT NULL , [SalesOrderNumber]nvarchar(20) NOT NULL , [QuantityNumber] smallint NOT NULL , [UnitPrice] moneyNOT NULL , [TotalAmount] moneyNOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX ,DISTRIBUTION = HASH([XXXXX]) ) ;
Which of the following will be the most suitable distribution column?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

While selecting a distribution key, there are certain points which we should look into.

->The column should have many unique values

->The column should not have null values

->Shouldn't be date

This will help improve parallel processing.

ProductID will be the best choice here.

Option A correct: It is a date value.

Option B is incorrect: It may not have much unique values.

Option C is incorrect: This will be the best choice by considering the points listed above.

Option D is incorrect: It may not have much unique values.

To know more about Distribution column selection, please refer to the doc below:

In a hash distributed table, the data is divided across multiple distributions (or nodes) based on a hash function applied to the distribution column. This is done to ensure that the data is distributed evenly across all the nodes, which helps in achieving better performance and scalability.

Choosing the right distribution column is crucial for achieving optimal performance. The distribution column should have a high level of cardinality (i.e., a large number of unique values), which helps in evenly distributing the data across all the nodes.

In the given table, the most suitable distribution column would be the column with the highest level of cardinality. Let's evaluate each of the options:

A. DateKey: This column represents the date of the sale. While it may have a moderate level of cardinality, it is not the best choice for a distribution column as there are only 365 days in a year, which means that the data may not be distributed evenly across all the nodes.

B. DiscountCode: This column represents the discount code applied to the sale. While it may have a moderate level of cardinality, it is not the best choice for a distribution column as there may be only a few unique discount codes, resulting in uneven data distribution.

C. ProductID: This column represents the ID of the product sold. This column is likely to have a high level of cardinality, as there may be many different products sold. Therefore, this column could be a suitable choice for the distribution column.

D. QuantityNumber: This column represents the quantity of the product sold. This column is unlikely to have a high level of cardinality, as there may be only a few different values for the quantity sold. Therefore, this column is not a suitable choice for the distribution column.

Based on the above analysis, the most suitable distribution column for this table would be ProductID.