Incident Management Protocol for Infrastructure Service Failure

Key Steps for Handling Service Failures and Incident Management

Question

You are on-call for an infrastructure service that has a large number of dependent systems.

You receive an alert indicating that the service is failing to serve most of its requests and all of its dependent systems with hundreds of thousands of users are affected.

As part of your Site Reliability Engineering (SRE) incident management protocol, you declare yourself Incident Commander (IC) and pull in two experienced people from your team as Operations Lead (OL) and Communications Lead (CL)

What should you do next?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

C.

In this scenario, as the on-call person for the infrastructure service, you have received an alert indicating that the service is failing to serve most of its requests and all of its dependent systems with hundreds of thousands of users are affected. This is a critical incident that requires an immediate response.

As part of your Site Reliability Engineering (SRE) incident management protocol, you have declared yourself the Incident Commander (IC) and have pulled in two experienced people from your team as Operations Lead (OL) and Communications Lead (CL). At this point, your next steps should include:

C. Establish a communication channel where incident responders and leads can communicate with each other.

Establishing a communication channel is the first step towards incident resolution. It is essential to ensure that everyone involved in the incident response is aware of the current situation and can communicate with each other effectively. This communication channel could be a dedicated chat room, conference call, or any other method that allows for quick and efficient communication between team members.

A. Look for ways to mitigate user impact and deploy the mitigations to production.

Once the communication channel has been established, you should focus on mitigating the impact on users. This could involve identifying the root cause of the issue and implementing a temporary fix to get the service back up and running. It is essential to prioritize user impact mitigation over finding the root cause of the issue, as the goal at this point is to get the service back to a stable state as quickly as possible.

B. Contact the affected service owners and update them on the status of the incident.

In parallel with user impact mitigation, you should also be in contact with the affected service owners and update them on the status of the incident. This will help them understand the scope and impact of the incident on their systems and can help them prepare for any necessary follow-up actions once the incident has been resolved.

D. Start a postmortem, add incident information, circulate the draft internally, and ask internal stakeholders for input.

Once the incident has been resolved, you should start a postmortem to review what happened, identify the root cause of the incident, and develop a plan to prevent similar incidents from occurring in the future. However, this step should be deferred until the incident has been resolved, and the service is stable again.

In summary, when responding to a critical incident, it is essential to establish a communication channel between incident responders and leads, mitigate user impact, update the affected service owners, and defer a postmortem until the incident has been resolved.