Resolve Incident in a Specific Region on Google Kubernetes Engine (GKE) | PCDE Exam Answer

Resolve Incident in a Specific Region on Google Kubernetes Engine (GKE)

Question

You support a popular mobile game application deployed on Google Kubernetes Engine (GKE) across several Google Cloud regions.

Each region has multiple Kubernetes clusters.

You receive a report that none of the users in a specific region can connect to the application.

You want to resolve the incident while following Site Reliability Engineering practices.

What should you do first?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

D.

https://cloud.google.com/error-reporting/docs/viewing-errors

As a DevOps Engineer, your primary objective is to ensure the reliability, availability, and performance of the application running on Google Kubernetes Engine (GKE) across multiple Google Cloud regions. In the case of an incident, it is important to follow Site Reliability Engineering (SRE) practices to resolve the issue effectively and efficiently.

The first step to resolve the incident would be to identify the root cause of the issue. To do this, you should start with the least disruptive and most informative method, which is to use logging and monitoring tools to gather data about the affected region.

Option A, which suggests rerouting the user traffic from the affected region to other regions that don't report issues, should not be the first step. It could result in a cascading failure and lead to more issues in other regions.

Option B, which suggests using Stackdriver Monitoring to check for a spike in CPU or memory usage for the affected region, is a good option to start with. It helps you to understand whether the issue is related to resource utilization or not. If there is a spike in resource utilization, you can investigate further to identify the root cause.

Option C, which suggests adding an extra node pool that consists of high memory and high CPU machine type instances to the cluster, may help to increase the resources available for the application. However, it should not be the first step as it is not clear whether the issue is related to resource utilization or not.

Option D, which suggests using Stackdriver Logging to filter on the clusters in the affected region, and inspect error messages in the logs, is a good option to start with. It helps you to identify the specific errors occurring in the application, which may provide insights into the root cause of the issue.

In conclusion, the best first step to resolve the incident while following Site Reliability Engineering practices would be to use Stackdriver Monitoring to check for a spike in CPU or memory usage for the affected region. If there is no spike in resource utilization, the next step would be to use Stackdriver Logging to filter on the clusters in the affected region and inspect error messages in the logs. Once the root cause of the issue has been identified, you can take appropriate action to resolve the incident.