Designing and Implementing a Data Science Solution on Azure: How to Access Files from Azure Blob Storage

Accessing Files from Azure Blob Storage

Question

You are storing your training data (jpg image files) in an Azure Blob Storage that you have registered as a datastore and a file dataset.

You want to run your training script which can access your files from this datastore.

How should you do that?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

Answer: B.

Option A is incorrect because the training script will load the input data from a datastore passed to it as a reference within an estimator parameter.Coding the list of files in the script is not a valid solution.

Option B is CORRECT because if you want to use a datastore in experiments scripts, you have to pass a reference to the datastore as an input parameter for the script (via the estimator)

The training script then can use data at the referenced location as local files.

Option C is incorrect because reference to the location of the data must be passed as a parameter to the training script.

Option D is incorrect because configuring the runs to access data in a datastore can be done by passing a reference to the data as an input parameter for the script.

Global parameters are not the way of solving the problem.

Example:

<pre class="brush:java;"># configure an estimator.

data_ref = blob_ds.path('input_data/training_files').as_download(path_on_compute='training_data')

my_estimator = SKLearn(source_directory='experiment_folder',

entry_script='training_script.py'

compute_target='local',

script_params = {'--data_folder': data_ref})

# submit script with my_estimator.

...

# how to pass datastore reference to a training script.

# define parameter for the script:

parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder reference')</pre>

Reference:

The best way to access the files from an Azure Blob Storage that has been registered as a datastore and file dataset is by creating a script parameter and passing the reference to the datastore within this parameter. This can be accomplished using option B.

Option A, which suggests including the list of file names in the training script, is not a recommended approach. It would make the script brittle and difficult to maintain, especially if the list of files were to change frequently.

Option C, which recommends storing the script in the same location as the data, may work in some scenarios but is not a scalable solution. In a real-world scenario, the data may be located in different geographic regions, and the script may need to be executed from different locations.

Option D, which suggests defining a global parameter for the path name of the data and letting the script use this for accessing the files, is not a recommended approach either. This option would require the script to have knowledge of the physical location of the data, which would make it less portable and less flexible.

Therefore, option B is the best approach. It involves creating a script parameter, such as "--data-folder", and passing the reference to the datastore to the training script within this parameter. This approach provides the following benefits:

  1. It keeps the script flexible and scalable. The script can be executed from anywhere, and the data can be located in different regions.

  2. It makes the script maintainable. If the location or the format of the data were to change, the script would not require any modifications.

  3. It ensures that the data is stored securely in Azure Blob Storage. By referencing the datastore, the script can access the data without requiring access to the storage account keys.

Overall, passing the reference to the datastore to the training script within a script parameter is the recommended approach for accessing data stored in Azure Blob Storage that has been registered as a datastore and file dataset.