Optimizing Machine Learning Pipeline Efficiency - PMLE Exam Question Answer | WebsiteName

Efficiently Minimize Computation Time and Manual Intervention

Question

You have a demand forecasting pipeline in production that uses Dataflow to preprocess raw data prior to model training and prediction.

During preprocessing, you employ Z-score normalization on data stored in BigQuery and write it back to BigQuery.

New training data is added every week.

You want to make the process more efficient by minimizing computation time and manual intervention.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

The goal is to make the demand forecasting pipeline more efficient by minimizing computation time and manual intervention. The pipeline uses Dataflow to preprocess raw data prior to model training and prediction. During preprocessing, Z-score normalization is employed on data stored in BigQuery, and the normalized data is written back to BigQuery. New training data is added every week.

Option A: Normalize the data using Google Kubernetes Engine Google Kubernetes Engine (GKE) is a managed Kubernetes service that provides an efficient, scalable, and highly available platform for deploying containerized applications. However, using GKE to normalize the data may not be the best approach for this scenario because it involves additional overhead and complexity, such as containerizing the normalization algorithm and setting up the Kubernetes environment. Therefore, option A is not the best choice for this scenario.

Option B: Translate the normalization algorithm into SQL for use with BigQuery This option involves translating the normalization algorithm into SQL to perform the normalization operation directly on the data stored in BigQuery. This approach could be efficient because it would minimize data movement and reduce the complexity of the pipeline. However, writing SQL queries to implement the normalization algorithm may require a significant amount of manual intervention and may not be as flexible as other approaches. Therefore, option B may not be the best choice for this scenario.

Option C: Use the normalizer_fn argument in TensorFlow's Feature Column API The normalizer_fn argument in TensorFlow's Feature Column API allows normalization to be performed within the TensorFlow training pipeline. This approach could be more efficient than using BigQuery because it would eliminate the need to move data between BigQuery and Dataflow. Additionally, using the normalizer_fn argument would allow for more flexibility in the normalization process and enable more complex normalization algorithms to be used. Therefore, option C is a good choice for this scenario.

Option D: Normalize the data with Apache Spark using the Dataproc connector for BigQuery Apache Spark is a popular distributed computing framework that can be used to process large datasets efficiently. The Dataproc connector for BigQuery enables Spark to read data from and write data to BigQuery. Using Apache Spark to normalize the data could be an efficient approach, but it may be more complex to set up and maintain than other options. Additionally, using Spark would require additional compute resources that may not be necessary for this scenario. Therefore, option D may not be the best choice for this scenario.

Overall, the best option for this scenario is option C, which involves using the normalizer_fn argument in TensorFlow's Feature Column API. This approach would minimize data movement, reduce pipeline complexity, and enable more flexibility in the normalization process.