The Psychology of Failure Reporting

Alex Cowhig
Jul 9, 2021
8 min read

Key thoughts discussed

· Ensure your alerting process isn’t noisy as people will find a way to silence the noise and your alerts will become meaningless

· Consider how the recipient of the alerts will react in the long term and how you want them to react

· Consider the level of independence of your notification service from your business processes so that they do not fail at the same time or under the same conditions

· Don’t target full automation but rather drive out unnecessary inefficiency and achieve scalability

Summary

When an IT process fails, we usually need to intervene to get things back on track or even just let our customers know that there will be a delay in the service we are providing to them. It's therefore important to have appropriate alerting in place. It's important to carefully think about how we do this. How people react to different alerting styles really matters.

Some Detail

In theory, the requirement to report IT failures is simple and may read something like:

· As a <business operations manager>,

· I want to be notified every time the <customer phone number update process> fails,

· So that I can <ensure that we hold off making calls to new customers until it is run>.

In the above user story, the elements in brackets could be any role, process and action which fits your operation.

Essentially, the requirement states… “I want to know every time something goes wrong with the named process”.

The problem comes when we consider the person receiving the notifications and some of the real-world scenarios that occur.

For example, what happens when the process works well for a few weeks and doesn’t suffer a failure?

This is good news, right?

In reality, the person we’re alerting to these failures starts getting twitchy and thinking things like “are the alerts getting blocked?” or “Has the alerting process stopped working?” To be fair, while frustrating for those who have built and are running such processes, these are perfectly reasonable questions. Either of those things may actually be happening. How can someone tell if no notification of a failure means that the process is working perfectly or, there is indeed something blocking the reporting of the errors? The truth is that they can’t tell so these concerns are both reasonable and rational.

If we subsequently prove that over the next few months everything is ok and that no report of a failure is good news, we risk complacency. Now, if the system reporting the failures really does suffer from an issue, no one is going to raise a concern as we have conditioned them to think that no news is good news.

Let’s look at our options on how to deal with this challenge when designing failure notifications.

The most common and immediate reaction if for the reporting of successes as well as failures each time the process runs.

· As a <business operations manager>,

· I want to be notified of the success or failure state of the process every time the <customer phone number update process> runs,

· So that I can <ensure that we hold off making calls to new customers until it is run successfully>.

In theory, this deals with the challenge, but I don't personally think this is a great idea. Three scenarios now exist:

1 – You receive a success alert: All is good in the world.

2 – You receive a failure alert: You know there is a problem with your customer phone number update process.

3 – You don’t receive an alert: You know there is a problem with your alerting process.

This is shown on the grid below.

Notice though that a few things have happened.

Firstly, we are now required to consider the failure reporting mechanism in our thinking as a business operations manager, we don’t want to care about this but we’re having to. This also complicates things for us as all we really want to know is if our process has run and if our business has been impacted so that we can take necessary action.

Secondly, in scenario 3, all we know is that there is an issue with the alerting process. We’re still left not knowing if there is an issue with our main process so while this information is somewhat useful to prompt action, we actually don’t have the information we’re most interested in. We don’t know if we can run business as usual and call our new customers or if we have to ask the operations team to focus on other things for the time being.

Lastly, and definitely not least, is we’re receiving an alert every time the process runs. Consider what this means for a daily process (and bear in mind that many processes today can run much more frequently) which is generating around 30 notifications per month. How are these notifications being delivered? Email is common, text messages too, or perhaps reports which the operations manager needs to access.

What is the user experience if they have several such processes that they’re responsible for, how many emails are they now receiving on an ongoing basis?

If most of these processes are stable and run correctly 99% of the time, the user has an incredible amount of ‘background noise’ – information that needs to be filtered out as it’s all about taking no action.

The result is that the operations manager finds a way to silence this distraction. An example of this silencing of noise is an email rule to direct these alerts and notifications into a folder where they can check on them every once in a while or the checking of the reports less frequently when they’re under other pressures as they rarely contain actionable information.

Ultimately, this silencing of the alerting drives all alerts into the background – positive and negative alerts both, and then when the system fails, despite all the effort invested in alerting, no one actually notices.

If this all sounds very theoretical, I can assure you that it isn’t – I’ve seen this happen time and again and I’ve seen the real-world consequences for businesses when alerts are missed and the issues can be exacerbated when people change roles as the further into the background they are, the less likely they are to feature in a handover to the new operations manager and in some cases, they can be lost completely.

The alternative

Here’s what we need…we need a process for reporting failures that…

Tells us clearly when a process has failed so action can be taken.
Isn’t noisy to the extent that people want to silence it.
People trust hasn’t failed itself when it doesn’t report a failure.

Let’s look a bit deeper into the challenge of automated notifications

Firstly, we need to be wary when we have dependencies-in-common between the business process and the notification process. An example here might be that both processes are running on the same server or cloud service. In this instance, if the server or service is unavailable, we get the worst type of scenario – a silent failure. This kind of failure is not reported, and we carry on with business oblivious to the fact that things are not as they should be. This can not only cause harm to increase over time as we haven’t taken corrective action but can actually compound the initial failure. Consider for example a process reporting something such as a change in medication and we continue to send out old medication or the wrong dosage, or a process notifying us of a customer bereavement or change of address? Acting when we’re oblivious to these kinds of changes in customer circumstances can have real consequences.

So, we need to ask "To what extent is our notification service independent from the services it is monitoring?" and "How will our monitoring and notification service be affected in the event of certain types of failure?"

In thinking about this independence, you could have the notification service on a separate dedicated server, for example or perhaps keep it in-house when you move business processes to a cloud-based solution or outsource it to a different supplier. There are lots of things that can be done with regard to this, and they will all depend on the set-up in your organisation and the level of risk posed by an unnoticed failure. The point here isn’t to provide a solution for every scenario but to highlight the importance of these questions.

I’m going to push the concept towards its extremes now to facilitate our thinking on this so you might find that this feels a little absurd but pushing an idea to extremes can help us to look at it from different angles and is a useful tool to aid our thinking on this topic so please stay with me.

Let’s say we have now have our business process, and we have a monitoring and notification service completely independently watching over it.

What is to say that we still haven’t had a process failure on both? If each of these processes fails 1% of the time (approximately 3.5 times in a year each for a daily process) there is a small chance, that we might find that they have both failed and we don’t get notified. This chance is small, at a little over 0.1% per year but if we have 100 business processes or so then the chance that one of them will fail on the same day as our notification process becomes meaningful.

Do we then need a separate process monitoring our monitoring process? And what about the possibility of that higher-level monitoring process failing? How can we ever be truly sure that everything is working correctly?

This is a question similar in nature to the one “who polices the police?” as if you add a further higher authority, then you can simply repeat the question again - who is monitoring the super police?

Let’s revisit what we’re trying to accomplish here, dispense with the absurdity and bring it back to business basics. What we’re trying to get to is confidence that things are running correctly so here’s what I would do:

I’d have my business processes running on whatever servers they need to run on and I’d have a single separate ‘listener’ service looking for the completion of a whole range of other services.

I’d then create a non-IT based / non-automated process – yes, I’m talking a good old-fashioned manual check - on the listener service at an appropriate frequency (say daily) or hourly or whatever was appropriate to what was being run.

What this does is it allows you to automate the listening and notification of failures in a structured way for your services at scale – through your listener service, and then just have a single point where manual checks are required. This means you’ve automated any number of check-in’s on your IT processes and converted these into a single check on the listener service. This is what scalability is all about – it’s not about eliminating the manual check altogether, it's about changing your problem from one where you have to have manual checks for your processes on a 1:1 basis to one where you have 1 manual check for any number of processes and that number can grow to almost any number you like with very little additional investment. The investment becomes proportionally smaller the more processes you add to this. Ultimately, this is the equation that drives scalability – invest once and reap the rewards, over and over again at scale and in this case, an investment in some old-fashioned manual checking I think pays dividends.

There are improvements that can be made to this model such as adding a ‘heartbeat’ to servers and services so that they report that they are up and running on a regular basis but in my view and experience, these are secondary to the core principles described above and work along side them.

In summary, my advice is to only report failures to the people who need to take action.

Do this from a centralised and independent monitoring and notification engine.

Have a comparatively small team of people looking after it this centralised service.

It is my belief that this set-up delivers the confidence to end users to react - and not react - with confidence while doing so in a scalable way.

First Published 09/07/2021

All views expressed in this article are solely those of the author

The Psychology of Failure Reporting

Key thoughts discussed

Summary

Some Detail

The alternative

Recent Posts

Comments