Incident Review

Something will always go wrong with our software systems. Issues might happen regularly or rarely, but something is going to go wrong at some point and a client will complain. Often the way we handle these situations determines whether our customers continue to do business with us or look for an alternative.

In my career, I've been a part of many incidents, lots of which were service outages for customers. I've attended some incidents as a technical person working to diagnose the issue. In other incidents, I've been a developer that has to fix code. During many, my role was as a manager trying to ensure information moves smoothly between resources, ensuring people double-checked that our "fix" wouldn't cause another problem. After a problem, I've also had to take part in a post-incident review. Unfortunately, this has happened far less often than it should.

When I read this description of a post-incident review, it is unlike many of the after-action meetings I've attended. In my experience, there is one meeting and someone is being blamed. Senior management is often there, putting pressure on everyone in attendance to "never let this happen again." I haven't known anyone that wanted to go through another outage or a post-incident meeting after an incident, but with complex systems, and humans managing them, something is bound to go wrong. We should expect to have incidents in the future.

What we want is this same type of incident to not get repeated, which comes about if we learn from our mistakes and design better protocols that help us catch mistakes. Often to ensure a human doesn't make a simple error. We should accept that mistakes will happen and try to find ways to detect the problem quickly, limit the scope of impact, and provide a way to share this knowledge with other workers. Depending on humans to be more perfect in the future isn't likely to be successful.

These days I read post-incident reviews that my company will publish after an outage, I find them fact-based, focusing on what things went wrong without blaming a person. They include analysis not only of the actual issue but the conditions that led to the hardware/software failure or decision that was made. There are learnings about how we might have prevented something with a time machine, while still assuming that humans would make mistakes or a component might fail. There are also suggestions for improvements in hardware, software, training, or monitoring that might assist in quicker recovery in the future.

Coming out of an incident with a positive mindset is the best way to try and prevent a repeat of the same incident in the future. This requires that we not only avoid blaming someone for an error but that we also take steps to limit the potential for future errors. If the issue is someone clicking the wrong selection in a drop-down or pressing "OK" when they meant to press "Cancel", there are limited ways to prevent those issues. However, we can adopt the mindset an outage is a team failure and build a habit of double checking each other. That's much better than blaming one person, giving the job to another human, who might easily make the same mistake.

Many humans struggle to avoid placing blame on others and just accepting that some mistakes will happen. A DevOps mindset, with blameless reviews instead focuses on how we can do better as a group, rather than how we failed as an individual. This little change helps us build a better team, one that often performs better in the future.

Rate

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

Rate

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

Incident Response Data

by Steve Jones

SQLServerCentral

Being prepared for a disaster might mean having a way to collect data when something occurs.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (1)

You rated this post out of 5. Change rating

2021-05-12

112 reads

Discuss

Impact Minutes

by Steve Jones

SQLServerCentral

Disaster Recovery (DR)

When downtime strikes, we may have to make decisions about which systems to focus our efforts upon. Steve talks about the impact of a disaster on your choices.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (2)

You rated this post out of 5. Change rating

2021-01-28

57 reads

Discuss

Recovering Databases From a Master Backup

by Steve Jones

SQLServerCentral

Losing your instance might result in the need to get information from what you have. Steve Jones looks at a way to get the proper version and patch, and database list, from what limited resources you might have.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

5 (5)

You rated this post out of 5. Change rating

2020-10-27

1,934 reads

Discuss

Make SQL Server Agent Jobs HADR Aware

by Steve Rezhener

SQLServerCentral

Introduction Always On Availability Groups (AGs/AG...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

4 (2)

You rated this post out of 5. Change rating

2020-10-22

6,334 reads

Discuss

DR as a Service

by Steve Jones

SQLServerCentral

Disaster Recovery (DR)

It's not the first task when I start a new job, but often as a DBA or developer, I usually ask about Disaster Recovery (DR) plans sometime within the first six months. If I'm a DBA, of course I need a plan. If I'm a developer, however, I still need to understand how this might […]

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2020-10-01

83 reads

Discuss

Incident Review

Rate

Share

Categories

Share

Rate

Incident Review

Rate

Share

Categories

Share

Rate

Related content

Incident Response Data

Impact Minutes

Recovering Databases From a Master Backup

Make SQL Server Agent Jobs HADR Aware

DR as a Service