Preventing Severe Incidents: Strategies for Site Reliability Engineering

Best Practices for Preventing Severe Incidents in Site Reliability Engineering

Question

Your company follows Site Reliability Engineering principles.

You are writing a postmortem for an incident, triggered by a software change, that severely affected users.

You want to prevent severe incidents from happening in the future.

What should you do?

Answers

Explanations

Click on the arrows to vote for the correct answer

A. B. C. D.

C.

As a Site Reliability Engineer, your goal is to ensure that your systems are reliable and prevent incidents from happening. However, in the event that an incident does occur, it is important to conduct a postmortem to determine the root cause of the incident and prevent it from happening in the future. Here are the steps you should follow to prevent severe incidents from happening in the future:

  1. Conduct a postmortem: The first step is to conduct a thorough postmortem to identify the root cause of the incident. This should involve gathering data from all relevant sources, including system logs, user reports, and incident reports. It's important to involve all stakeholders in the postmortem, including developers, operations staff, and other key personnel.

  2. Identify the cause: Once you have gathered all the relevant data, you should work to identify the root cause of the incident. This may involve looking at the code that was changed, analyzing system logs, or investigating the environment in which the incident occurred. It's important to be as thorough as possible in this step, as identifying the root cause will help prevent similar incidents from occurring in the future.

  3. Implement corrective actions: Based on the findings of the postmortem, you should implement corrective actions to prevent similar incidents from happening in the future. This may involve modifying the code, changing the environment, or implementing new monitoring tools. It's important to involve all stakeholders in the implementation of corrective actions to ensure that they are effective.

  4. Monitor and evaluate: Once corrective actions have been implemented, it's important to monitor and evaluate their effectiveness. This may involve monitoring system logs, conducting regular tests, or analyzing user feedback. If the corrective actions are not effective, it may be necessary to revise them or implement new ones.

Given the answers provided, option B is the most appropriate. Ensuring that test cases that catch errors of the type that caused the incident are run successfully before new software releases is a key step in preventing similar incidents from happening in the future. This will help to catch any errors before they are released into the production environment and affect users. Options A, C, and D are not appropriate as they do not focus on preventing similar incidents from happening in the future. Escalating the incident to senior management or identifying engineers responsible for the incident may create a blame culture, which is counterproductive to preventing incidents. Following up with employees who reviewed the changes is not a proactive approach to preventing incidents. Designing a policy that requires on-call teams to immediately call engineers and management when an incident occurs is reactive and does not address the root cause of the incident.