Minimize Effort and Cost: NVIDIA Tesla P100 GPU Access for ML Team on Google Kubernetes Engine (GKE) Cluster

Access to Nvidia Tesla P100 GPUs for Machine Learning Team on GKE Cluster

Question

You are operating a Google Kubernetes Engine (GKE) cluster for your company where different teams can run non-production workloads.

Your Machine Learning (ML) team needs access to Nvidia Tesla P100 GPUs to train their models.

You want to minimize effort and cost.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

B.

Sure, I'd be happy to explain the options and recommend a solution.

Option A suggests asking the Machine Learning (ML) team to add the accelerator: gpu annotation to their pod specification. This option assumes that the GKE cluster already has nodes with GPUs enabled. However, it is not clear from the question whether the GKE cluster has nodes with GPUs enabled or not. Additionally, this option does not minimize effort and cost as it puts the onus on the ML team to modify their pod specification.

Option B suggests recreating all the nodes of the GKE cluster to enable GPUs on all of them. While this option would ensure that all nodes have GPUs enabled, it is an inefficient solution as it requires recreating the entire cluster and incurring downtime. Additionally, it is not a cost-effective solution as it requires more GPU-enabled nodes than what the ML team requires.

Option C suggests creating a separate Kubernetes cluster on Compute Engine with nodes that have GPUs and dedicating it to the ML team. This option is a feasible solution but it may not be the most efficient one. Creating a separate cluster requires more management overhead, and it may not be a cost-effective solution as it requires maintaining a separate cluster.

Option D suggests adding a new node pool to the GKE cluster that is GPU-enabled and asking the ML team to add the cloud.google.com/gke-accelerator: nvidia-tesla-p100 nodeSelector to their pod specification. This option is the most efficient and cost-effective solution as it adds GPU-enabled nodes only when needed and avoids creating a separate cluster. Additionally, it minimizes the management overhead as it keeps all workloads in one cluster.

Therefore, the recommended solution is to add a new, GPU-enabled node pool to the GKE cluster and ask the ML team to add the cloud.google.com/gke-accelerator: nvidia-tesla-p100 nodeSelector to their pod specification.