Azure Databricks Workspace Tiered Structure

Creating Databricks Clusters for Workloads

Question

Note: This question is part of a series of questions that present the same scenario. Each question in the series contains a unique solution that might meet the stated goals. Some question sets might have more than one correct solution, while others might not have a correct solution.

After you answer a question in this section, you will NOT be able to return to it. As a result, these questions will not appear in the review screen.

You plan to create an Azure Databricks workspace that has a tiered structure. The workspace will contain the following three workloads:

-> A workload for data engineers who will use Python and SQL

-> A workload for jobs that will run notebooks that use Python, Scala, and SQL

-> A workload that data scientists will use to perform ad hoc analysis in Scala and R

The enterprise architecture team at your company identifies the following standards for Databricks environments:

-> The data engineers must share a cluster.

-> The job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster.

-> All the data scientists must be assigned their own cluster that terminates automatically after 120 minutes of inactivity. Currently, there are three data scientists.

You need to create the Databricks clusters for the workloads.

Solution: You create a Standard cluster for each data scientist, a Standard cluster for the data engineers, and a High Concurrency cluster for the jobs.

Does this meet the goal?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

B

We need a High Concurrency cluster for the data engineers and the jobs.

Note:

Standard clusters are recommended for a single user. Standard can run workloads developed in any language: Python, R, Scala, and SQL.

A high concurrency cluster is a managed cloud resource. The key benefits of high concurrency clusters are that they provide Apache Spark-native fine-grained sharing for maximum resource utilization and minimum query latencies.

https://docs.azuredatabricks.net/clusters/configure.html

The proposed solution meets some of the requirements and goals of the scenario, but not all of them.

Firstly, it satisfies the requirement that each data scientist must have their own cluster that automatically terminates after 120 minutes of inactivity. The solution proposes creating a Standard cluster for each data scientist, which can be configured with automatic termination after a set period of inactivity. This means that the company can optimize their costs by only paying for compute resources when they are actually being used.

Secondly, the solution also meets the requirement that the data engineers must share a cluster. A Standard cluster can be shared among multiple users, so creating a single Standard cluster for the data engineers would be a reasonable choice.

Thirdly, the solution proposes creating a High Concurrency cluster for the jobs workload. This could be a good choice, as High Concurrency clusters are designed to support a high volume of interactive queries and concurrent users. This would allow multiple users to submit and run jobs on the same cluster, while still ensuring high performance.

However, the solution does not fully meet the requirement that the job cluster will be managed by using a request process whereby data scientists and data engineers provide packaged notebooks for deployment to the cluster. The proposed solution does not specify how the job cluster will be managed or how notebooks will be deployed to the cluster. Depending on the company's requirements and processes, this could be a significant issue.

Additionally, the solution does not fully meet the requirement that the data scientists will use Scala and R for ad hoc analysis. The proposed solution only creates Standard clusters for the data scientists, but it does not specify whether those clusters will be configured to support both Scala and R. Depending on the company's needs, this could be a limitation.

In conclusion, while the proposed solution partially meets some of the requirements of the scenario, it does not fully meet all of them. Therefore, the correct answer to the question is B. No.