Optimizing Performance of Azure ML Pipelines

Combining Python Scripts for Consistent and Repeatable Machine Learning Tasks

Question

You already have the Python scripts for several steps (including data ingestion, data cleansing, dividing data into train and test sets etc.) of your machine learning tasks but you want to combine them into a consistent, repeatable flow.

You want to make use of the orchestration services offered by Azure ML pipelines.

You have defined 3 three steps, each of them referencing a piece of your Python code:

step1 = PythonScriptStep(name="train_step",  script_name="train.py",  compute_target=aml_compute,  source_directory=source_directory,  allow_reuse=False) step2 = PythonScriptStep(name="compare_step",  script_name="compare.py",  compute_target=aml_compute_cluster2,  source_directory=source_directory,  allow_reuse=False) step3 = PythonScriptStep(name="extract_step",  script_name="extract.py",  compute_target=aml_compute,  source_directory=source_directory,  runconfig=run_config)
Which parts of the codes should be changed to ensure optimal performance?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: B.

Option A is incorrect because there is no such constraint for allow_reuse.

This is an optional parameter with default value of True.

It determines whether outputs of the step's previous run can be used by the following steps in order to save execution time.

Either True or False can be correct.

Option B is CORRECT because in order to optimize the behavior of the pipeline during runs, the recommended practice is to use separate folders for storing scripts and its dependent files for each step.

These folders should bethe source_directory for the steps.

This way, the size of the snapshot created for the step can be reduced.

Option C is incorrect because there is a compute target at pipeline level which is used by the steps unless specified otherwise at step level.

Steps with individual compute requirements can define their own compute target.

Option D is incorrect because the runconfig parameter can be used to specify additional requirements for the run, such as conda dependencies (e.g.

‘scikit-learn')

When missing, a default runconfig will be created.

See step parameters below:

PythonScriptStep(script_name, name=None, arguments=None, compute_target=None, runconfig=None, runconfig_pipeline_params=None, inputs=None, outputs=None,params=None, source_directory=None, allow_reuse=True, version=None, hash_paths=None)

Reference:

To ensure optimal performance, the following changes should be made:

A. allow_reuse should be set True for the 1st step of a pipeline By setting allow_reuse=True for the first step of the pipeline, the subsequent runs of the pipeline will reuse the previous run's output if the input data and parameters have not changed. This can save time and resources by avoiding unnecessary computation.

B. source_directory should reference different folders for each step The source_directory should reference different folders for each step to ensure that each step has access only to the required files and dependencies. This can help avoid version conflicts and ensure reproducibility.

C. compute_target should be the same compute for each step in a pipeline The compute_target should be the same for each step in the pipeline to ensure consistency and avoid unnecessary data transfers between different computes. This can save time and resources.

D. parameter runconfig should be set for each step The runconfig parameter should be set for each step to ensure consistency in the environment and execution settings. This can help avoid version conflicts and ensure reproducibility.

In summary, to ensure optimal performance, the allow_reuse parameter should be set to True for the first step, the source_directory should reference different folders for each step, the compute_target should be the same for each step, and the runconfig parameter should be set for each step.