Designing and Implementing a Data Science Solution on Azure - Exam DP-100: Answer Verification

Machine Learning Pipeline Data Separation using Python SDK | Exam DP-100

Prev Question Next Question

Question

You are developing a ML pipeline using Python SDK, and you have to separate your data for training the model as well as for testing the trained model.

You got a code snippet from your teammate, which is a great help, if it works.

You have to check if it does the job.

By its description, the script loads data from the default datastore and separates the 70% of the observations for training and the rest of them for testing, by using the scikit-learn package, in a reproducible way.

from sklearn.model_selection import train_test_split # Get the experiment run context run = Run.get_context() # load data print("Loading Data...") diabetes_data = run.input_datasets['diabetes_train'].to_pandas_dataframe() # Separate features and labels X, y = diabetes_data[['Pregnancies','PlasmaGlucose', 'DiastolicBloodPressure','BMI','Age']].values,  diabetes_data['Diabetic'].values # Split data into training set and test set X_train, X_test, y_train, y_test =  train_test_split(X, y, test_size=0.30, random_state=0)

After reviewing the code, do you think it does its job as described?

Answers

A. Yes

B. No.

Show Answer

Explanations

Click on the arrows to vote for the correct answer

A. B.

Answer: A.

Option A is CORRECT because the code is correct.

It does its job exactly as it is described.

Option B is incorrect because the code is correct.

Reference:

Based on the code snippet provided, it seems that the script does separate the data for training and testing as described.

Here's a detailed explanation of the code:

The script imports the necessary packages including "train_test_split" from "sklearn.model_selection" to split the data into training and testing sets.
The script then gets the experiment run context using "Run.get_context()" method.
The script loads the diabetes_train dataset from the default datastore by using "run.input_datasets[diabetes_train].to_pandas_dataframe()" method, and converts it into a pandas dataframe.
The script separates the features and labels by using "X, y = diabetes_data[[Pregnancies,PlasmaGlucose,DiastolicBloodPressure,BMI,Age]].values, diabetes_data[Diabetic].values" code line.
Finally, the script splits the data into training and test sets by using "train_test_split(X, y, test_size=0.30, random_state=0)" code line, where 70% of the data is allocated for training, and 30% for testing, and the random state is set to 0 for reproducibility.

Therefore, based on the above analysis, we can conclude that the code does its job as described.

Prev Question Next Question