Azure Data Science Solution: Optimizing ML Pipeline Execution Times

Optimizing ML Pipeline Execution Times

Question

You have created an ML pipeline of eight steps, using Python SDK.

While tuning the script of steps3 and 6, you submit the pipeline for execution several times.

The scripts and data definitions of the other steps didn't change, but still you notice that all the steps rerun each time, and you experience long execution times.

You decide to separate the scripts and other configuration files to different folders for each step and to set the source_directory parameters accordingly.

Will it probably solve the problem?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

Answer: A.

Option A is CORRECT because if you experience unexpected reruns of pipeline steps whose underlying code didn't change, you should put the scripts and configuration items to separate folders for each step.

This should solve the problem.

Option B is incorrect because unexpected rerun of pipeline steps is a clear indicator of the problem caused by storing the codes of multiple steps in one, common location.

Any time, any of the scripts change, all the steps referencing this source_directory will rerun, consuming extra time and resources.

This practice should be avoided.

Reference:

Answer: A. Yes.

Explanation:

When you submit an ML pipeline for execution, Azure ML generates a Docker container image containing all the dependencies and scripts required to run the pipeline. If any of the dependencies or scripts have changed since the last execution, Azure ML will rebuild the container image, and all the steps will be re-run.

In this case, even though the scripts and data definitions of the other steps didn't change, the pipeline was re-executed each time because some of the scripts in steps 3 and 6 were changed. By separating the scripts and other configuration files to different folders for each step and setting the source_directory parameters accordingly, you can ensure that each step only depends on the scripts and files in its own folder.

This means that if you only change the scripts or files in a specific step, Azure ML will only rebuild the Docker image for that step and rerun that step, rather than rebuilding the entire Docker image and rerunning the entire pipeline. This will save you time and resources and result in faster execution times.

Therefore, the correct answer is A. Yes, separating the scripts and configuration files into different folders for each step and setting the source_directory parameters accordingly will solve the problem of long execution times when only some of the steps are modified.