Incidents Will Happen – Prepare Your Technology Team to Address Them

Engineering failures are unavoidable, but effective risk management ensures that incidents do not become recurring problems. Through rigorous testing, structured incident response, and a blameless culture, organizations can turn unexpected failures into opportunities for long-term system improvement and operational resilience.

If the CEO of any company announces that they’re going to bring in a new CTO to finally put an end to incidents, I must warn you that they’ll be uselessly spending their money. Such an assignment is unrealistic and even the best CTO in Silicon Valley wouldn’t be able to live up to it.

There isn’t a single company in the world where there aren’t incidents because they’re part of the game — and part of life.

Those who never have had to face incidents won’t be prepared to deal with them when they inevitably arise. That doesn’t mean, however, that we should sit back and wait for them to happen. A good CTO, then, is one who establishes a culture that values collaborative work and plans a system of technical checks and balances to deal with incidents — not the one who promises to put an end to them.

I really like the book The Black Swan: The Impact of the Highly Improbable, by Nassim Nicholas Taleb, for its idea that the unexpected is an integral part of life and we can, therefore, take advantage of it. The book’s title refers to the fact that, just because we’ve never seen one, we can’t assume that black swans don’t exist. This kind of thinking is also fundamental to the analysis of computer systems. We can’t believe that there aren’t bugs and that the system is correct. We must assume that there are bugs and they just haven’t manifested themselves in production yet.

Incidents often occur because someone tampered with some part of the system, leading to a different code execution path and triggering the problem — the manifestation of that bug that had been sitting there for a while. A principle of software engineering is that if a piece of code hasn’t been tested, it has bugs. In large-scale systems, it’s impossible to test every interaction, which means that dormant bugs exist and will eventually appear.

The culture of testing is extremely important, but it’s impossible to predict every situation. Say, for example, you know it always takes you 10 minutes to get your child from home to school based on the experience of the hundreds of journeys you’ve made on the same route. But one day it takes you 50 minutes because a tree fell in the middle of the road and stopped all traffic. Events like this happen, and there’s no way of predicting them.

Problems arise without anyone anticipating them. Engineering processes are designed to minimize incidents, and these processes should be rigorous enough to reduce their scale. The worst-case scenario is ignoring an incident that happened and allowing it to manifest again. This type of situation is no longer an incident, but a recurring problem.

As an engineering leader, I don’t want recurring problems — only one-off incidents that happen from time to time. When an incident is identified, short-, medium-, and long-term structuring actions must be activated so that it doesn’t happen again. It’s important that we assimilate the logic of the black swan and prepare ourselves to mitigate incidents quickly when they happen, minimizing the impact on our customers.

Late Google engineer Luiz André Barroso used to tell the story (I’m not sure if it’s true or legend) of a large-scale incident where one of Google’s data centers became totally disconnected. When people investigated the cause of the service interruption, they discovered that it was…a horse. Somewhere near a Google data center in South America, a horse died, and they dug a deep hole to bury it. It turns out that in the process of digging, they ended up hitting the underground network cables connecting the data center to Google’s network, taking it completely offline.

What can we learn from this story? Could this horse incident have been avoided? Probably not, but once it was discovered that this possibility existed, it became necessary to question whether the cables should be installed even deeper, or whether they should be encased in some kind of cut-proof metal.

The main goal during an incident is to get the system back up and running. The mitigation process afterwards usually involves a meeting with the engineers involved who try to locate the root cause of the problem. Then they move on to cleaning up anything that’s been left behind, investigating and identifying the root cause, and writing the post-mortem, which is a report describing what happened and proposing improvements for the future.

For the post-mortem, I use the “5 whys strategy” developed at Toyota to solve quality problems with their products. The technique involves asking:

What was the problem?A horse died! So what?

Why did it happen?The hole they dug to bury the horse cut the server cable. Was the incident a result of the complexity of the system? No.

Why did the next thing happen?Ah, so the root cause is the cut cable, not the horse’s death.

How likely is it that this will happen again?It may not be a horse burial, but it could be someone digging another deep hole.

How can we make sure the problem won’t happen again?From there, we start thinking about some actions so that similar problems don’t happen in data centers built in the future.

The purpose of looking for the root cause and writing the post-mortem is to generate learning and prevent similar problems from recurring. An important point is that all of this must take place in a blameless environment. Leadership must constantly reinforce blamelessness as the company culture. Otherwise, no one will believe it. If holding post-mortem forum where an incident is reviewed with the company’s engineering team, it’s important to state, “Don’t forget that we’re presenting the post-mortem here, and it’s for learning, not for pointing fingers.”

In principle, no one is responsible for the incident, it just happened. And the more checks and balances there are in the engineering life cycle, the more the focus is taken away from looking for culprits. It’s true that, in most cases, it’s a new code written by somebody that causes an incident, but the validations should have ensured that it wouldn’t be a problem.

Human failures are to be expected. This needs to be integrated into every company’s risk management system.

Incidents Will Happen – Prepare Your Technology Team to Address Them

Be the first to comment

Leave a Reply Cancel reply