Designing and Implementing a Data Science Solution on Azure - Exam DP-100: Answer Verification

Machine Learning Pipeline Data Separation using Python SDK | Exam DP-100

Question

You are developing a ML pipeline using Python SDK, and you have to separate your data for training the model as well as for testing the trained model.

You got a code snippet from your teammate, which is a great help, if it works.

You have to check if it does the job.

By its description, the script loads data from the default datastore and separates the 70% of the observations for training and the rest of them for testing, by using the scikit-learn package, in a reproducible way.

from sklearn.model_selection import train_test_split # Get the experiment run context run = Run.get_context() # load data print("Loading Data...") diabetes_data = run.input_datasets['diabetes_train'].to_pandas_dataframe() # Separate features and labels X, y = diabetes_data[['Pregnancies','PlasmaGlucose', 'DiastolicBloodPressure','BMI','Age']].values,  diabetes_data['Diabetic'].values # Split data into training set and test set X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.30, random_state=0) 
After reviewing the code, do you think it does its job as described?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B.

Answer: A.

Option A is CORRECT because the code is correct.

It does its job exactly as it is described.

Option B is incorrect because the code is correct.

Reference:

Based on the code snippet provided, it seems that the script does separate the data for training and testing as described.

Here's a detailed explanation of the code:

  1. The script imports the necessary packages including "train_test_split" from "sklearn.model_selection" to split the data into training and testing sets.

  2. The script then gets the experiment run context using "Run.get_context()" method.

  3. The script loads the diabetes_train dataset from the default datastore by using "run.input_datasets[diabetes_train].to_pandas_dataframe()" method, and converts it into a pandas dataframe.

  4. The script separates the features and labels by using "X, y = diabetes_data[[Pregnancies,PlasmaGlucose,DiastolicBloodPressure,BMI,Age]].values, diabetes_data[Diabetic].values" code line.

  5. Finally, the script splits the data into training and test sets by using "train_test_split(X, y, test_size=0.30, random_state=0)" code line, where 70% of the data is allocated for training, and 30% for testing, and the random state is set to 0 for reproducibility.

Therefore, based on the above analysis, we can conclude that the code does its job as described.