Azure DevOps Pipelines for Data Ingestion and Validation with Data Factory, Databricks, and Azure

Data Pipeline Steps and Task Names for Data Factory, Databricks, and Azure DevOps

Question

You want to take the advantages of using the DevOps pipelines provided by Azure.

You need to use Data Factory to ingest data and run a notebook on a Databricks cluster, which checks if the data has been ingested correctly and validates the result data file.

Steps of your pipeline looks like this:

# run pipeline - job: "test job" displayName: "Test job" dependsOn: [Deploy_to_Databricks, Deploy_to_ADF] pool: vmImage: 'ubuntu-latest' timeoutInMinutes: 0 steps: - task: <.......task1........>@4 displayName: DF Pipeline' inputs: azureSubscription: $(AZURE_RM_CONNECTION) ScriptPath: '$(Build.SourcesDirectory)/adf/temp/My_DFPipeline.ps1' ScriptArguments: '-ResourceGroupName $(RESOURCE_GROUP) -DataFactoryName $(DATA_FACTORY_NAME) -PipelineName $(PIPELINE_NAME)' azurePowerShellVersion: LatestVersion - task: <.......task2........>@0 inputs: versionSpec: '3.x' addToPath: true architecture: 'x64' displayName: 'Python3.x' - task: <.......task3........>@0 inputs: url: '$(DATABRICKS_URL)' token: '$(DATABRICKS_TOKEN)' displayName: 'Databricks config' - task: <.......task4........>@0 inputs: notebookPath: '/Shared/devops-ds/test-data-ingestion' existingClusterId: '$(DATABRICKS_CLUSTER_ID)' executionParams: '{"bin_file_name":"$(bin_FILE_NAME)"}' displayName: 'Ingest data' - task: waitexecution@0 displayName: 'Wait until the testing is done' 
Match the name of the pipeline steps with the task names in the above script:

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Correct Answer: C.

Option A is incorrect because running the notebook mustang be preceded by ingesting data by Data Factory and setting up the environment.

Option B is incorrect because task2 defines setting the Python version, while task3 defines the Databricks environment.

Option C is CORRECT because the script first runs a Data Factory pipeline from PowerShell, then sets the Python version and configures Databricks, and finally executes a notebook on a Databricks cluster.

Option D is incorrect because executenotebook is the last step in the sequence (task4)

Reference:

The pipeline steps in the script are:

  1. A PowerShell task to run the Azure Data Factory (ADF) pipeline using a PowerShell script.
  2. A task to set up the Python environment in the pipeline.
  3. A task to configure the Databricks cluster.
  4. A task to execute the notebook on the Databricks cluster.
  5. A task to wait for the execution of the notebook to finish.

The names of the tasks in the script are not explicitly given, but we can deduce them from the task types and inputs.

  • Task 1: The first task is a PowerShell task that runs the Azure Data Factory (ADF) pipeline. The task type is not given, but the task name is "DF Pipeline" based on the displayName input. Therefore, the name of this task is "DF Pipeline".

  • Task 2: The second task sets up the Python environment in the pipeline. The task type is "Use Python Version". Therefore, the name of this task is "Use Python Version".

  • Task 3: The third task configures the Databricks cluster. The task type is "Databricks Config". Therefore, the name of this task is "Configure Databricks".

  • Task 4: The fourth task executes the notebook on the Databricks cluster. The task type is "Execute Notebook". Therefore, the name of this task is "Execute Notebook".

  • Task 5: The fifth task waits for the execution of the notebook to finish. The task type is "Wait Execution". Therefore, the name of this task is "Wait Execution".

Therefore, the correct answer is option C: AzurePowerShell; UsePythonVersion; ConfigureDatabricks; ExecuteNotebook.