Failsafe

Alex Cowhig
Jul 24, 2021
10 min read

Applying a key principle learned from many years of mechanical engineering to software processes

Key thoughts on this topic

· What a failsafe is and how it works

· The Otis Safety Elevator

· The Deadman’s Handle

· How to apply a failsafe in IT processes

Summary

In some industries and applications, when things go wrong, the results can be disastrous and even result in death. In today’s computer-connected world, this can also be the case in software, and even when it isn’t, we should consider the consequences of a system failure and determine if and how we should put in place mechanisms to ensure that they fail in a way that is acceptable and which protects users from harm.

We have been doing this for much longer in mechanical engineering and there is much that we can learn by looking at these systems, drawing out the key principles, and looking at how these same principles can be implemented in software processes.

The Otis Elevator

I’m sure most of you reading this right now have used an elevator – or a lift. We get into these and use them to ride many floors up in a large building. We think of strong cables, attached to a pulley system and motor at the top of a large shaft which is then connected to the box which we’re riding. While many of us remain nervous about riding in them, we trust them with our lives in the knowledge that safety systems are in place.

I’m equally sure we’ve all seen the movies, where someone cuts these cables and the lift plummets to smash down many floors below.

The reality is, however, that if you wanted to see a lift plummet like this you would have to override safety systems first as there are several ways that lifts are protected from this kind of failure. The mechanisms they have in place don’t require someone to keep a check on what is going on or require action to be taken if signs of a failure are imminent – they kick in automatically if a failure occurs and exactly because a failure has occurred. They are built inherently into the fabric of the design in such a way as to control the failure itself and mitigate the worse effects and ensure that any failure is safe.

Let’s look at a well-known example of this type of mechanism in the Otis Safety Elevator:

Before lifts became popular, they were used almost exclusively for moving goods as they were considered unsafe for transporting people. Elisha Otis designed a system that (as shown in the picture accompanying this article) involved arms connected to the elevator that would, by the application of a tensioned spring, be forced into the side of the rails. The way to stop these brakes from being applied was to apply an upward force counteracting that of the springs so that these arms were pulled away from the rails and allowing the elevator to move up and down. What this meant was that by default – without a cable counteracting the spring by applying an upward pull, – the elevator was going nowhere. Thus, if the cable were ever to break, the default of ‘being stuck’ would arise. While no one wants to be stuck in an elevator which can’t move up or down, this is a whole world better than being stuck in one with no cable and no brake. The system was designed so that it would fail safely.

The development of this failsafe system and the ones that came after it generated trust and created the possibility for the much taller we see in cities around the world today.

Deadman’s Handle

Let’s look at another example of a mechanical failsafe – the deadman’s handle.

The principle of a deadman’s handle or deadman’s switch is that for a piece of machinery to keep operating, the operator needs to continually confirm that they want the operation to continue. These types of failsafe mechanisms are so common that they often go completely unnoticed yet we’ve probably all used them. Any time you’ve used a power tool, lawnmower, even a hand blender for mixing food, you’ve likely been using a deadman’s handle. Any time you’ve operated something that requires you to hold a trigger to keep the device running, you’ve used a deadman’s handle.

In a deadman’s handle, if you release the trigger either deliberately or because you are no longer capable, then the motor stops. This prevents the user from causing harm (or greater harm) to themselves or others should they become incapacitated. It’s interesting to note that a deadman’s handle protects to a large degree against operator failure such as fatigue as the constant input of energy is a requirement for the continued operation of the machinery.

When we look at the examples above, it’s clear that failsafe mechanisms have played a critical part in creating a trusted environment for the world we live in. This trust has led to wider use of the devices in which these failsafes exist and has spurred progress. Without the safety systems in place in elevators, it is claimed that the cities we know today would simply not have been possible and if power tools and heavy plant machinery had on-off switches rather than triggers, it seems certain that there would have been many more accidents and the use of automated power tools would be less prevalent too.

Attributes of a failsafe

Looking at the examples above, here’s what I would say characterise a good failsafe mechanism.

A Failsafe:

1. Determines how a process will fail – it is not designed to prevent failure

· In the Otis Safety Elevator, the cable may still break
· In a deadman’s handle, the operator can still get fatigued

2. Establishes a default state for the system which is safe (remember that safe here means that it avoids the worst-case scenario and does not mean a perfect outcome - consider being trapped in a lift vs plummeting 30 floors…)

· In the Otis Safety Elevator – the brakes are on by default stopping the lift
· In a deadman’s handle – the motor is off

3. Requires the system need constant input in some way, a form of energy or force to allow operation

· In the Otis Safety Elevator – the cable needs to apply an upward pull to counter the spring
· In a deadman’s handle – the operator must always hold the handle closed

A Failsafe isn’t:

· An extra process that you run if something fails
· A notification for someone to do something in the event of a failure
· Something that requires the input of additional energy or force in the event of a failure (people time, processing power, etc).

Applying a failsafe in Software and IT

When we build IT software, we’re often looking for ways to reduce failures to improve the user experience and improve efficiencies such as reduce time fixing bugs or restarting processes. While this is all good, while a failure can occur, the implementation of failsafe mechanisms can provide an important aspect to preventing damage and avoiding the worst-case scenarios.

Let’s look at an example and work through what a failsafe may look like in this scenario:

We will all have had to enter a username and password when accessing systems. We do this every day to access our bank account, book a holiday online or buy groceries.

Behind the scenes, the IT departments in all organisations have usernames and passwords to allow access to administration functions and preventing most of the company staff from accessing these systems, and giving only a few people full ‘system administrator’ access.

With critical systems, even this system administrator access is blocked to prevent malicious damage to the system and to harden systems against cyber-attacks. When someone does need to access the system with an elevated level of security permissions, a process to grant them additional access on a temporary basis is required. This is sometimes referred to as a “break-glass process” in reference to breaking the glass to get to an emergency key or depress a button behind a glass panel that prevents accidental access.

If we don’t think carefully about how this process will work though, it is very easy to build this process in a way that allows the break-glass process to fail which will result in our system protections being weaker long after they have completed the required work.

What would failure look like in a break-glass process?

Here’s the process:

· Aim: To ensure that we don’t give anyone additional access to the system unless they truly need it.
· Operation: Add accesses to someone once they have been approved as needing it, remove them once they no longer need it.
· Potential Failure under operation: We leave people with accesses that they shouldn’t have which risks data leakage or system damage

Looking at the above, there are lots of ways to protect against this kind of failure including a weekly audit of accounts with elevated privileges, having two people checking that accounts have been removed or reset. These don’t meet the three criteria of a failsafe however as they all require operator involvement to work. So, we’re looking for a failsafe process, something which if we or the system operator fail to act, ensures that the process fails in a way that prevents the scenario identified from being the outcome.

When we apply our tests, we can see ways to change the process to create our failsafe.

Here’s a very simple example of what we could do:

Have a system process that runs automatically say once per day which removes any protected account privileges from accounts. In this way, you can add accesses for a system engineer and if you fail to do anything else then these accesses are automatically removed at the end of the day.

This passes our first test as in the event of a failure or inaction or if we grant access to the wrong person, the accesses will be removed

It passes our second test as the default for the system is that it will return to a predetermined, safe, state.

And it passes our third test as if we don’t re-add the permissions then the permissions will be removed.

While this passes all the tests, we can improve upon this a lot. For example, the engineer may require only a few minutes of access to the system to complete the required work or they may require several days. In either of these cases, we either allow the accesses to remain in place much longer than necessary or have them removed partway through causing unnecessary inconvenience.

A good improvement may then be to create an approved expanded permissions table that allows us to establish the user, the approved privileges, and the time period for which we are approving them for. I’d suggest that the time period have some sensible maximum depending on the business process or expectations – a few days out from the current date at most would be likely.

In this scenario then, we could have our ‘remove privileges’ code running on a constant basis (every 15 minutes, every hour say or set something up with event listening to trigger on reaching the set date and time on the permission record. The process would remove accesses where there is no active record (a record where the permission date is in the future) in the table showing that the user should have access.

This then would allow a very short or much longer time period to be set of the permissions and if longer was required, would allow the record to be updated before it expires (or another record added with additional approvals) ensuring that the accesses are maintained.

If you ever wanted to remove accesses early, you could also use this process to trigger the removal simply by expiring the record.

This is just one example, but I think the principles of a failsafe mechanism described here will prove useful in thinking about processes and how to protect against the worst outcome of failures in software and process design far more broadly.

Some further factors

Here are a few other things you might want to think about in your IT process when thinking about how your process fails and to make it failsafe…

Listen for success, not failure

Most processes have dependency chains – a process waits on one or more other processes to complete or for a certain date or time to pass. When thinking about these chains, are you checking that nothing has failed or are you looking for a confirmation that things have worked correctly upstream? The latter is far more likely to give you a process that fails safely as you’re looking for positive confirmation before the process can continue which, in the event of a failure will not arrive thus preventing the process from running. Listening for a failure message could be disastrous if that message arrives late or simply doesn’t get through for some reason.’

This is a principle similar to that used when trying to communicate clearly in a rescue situation – you’re taught to never shout things like “Don’t Grab me” as if part of the message is missing the receiver may simply hear “Grab me”. Use this principle and ensure your process listens for positive confirmation before progressing where it has important dependencies rather than listening for an error.

Does everything check out?

Consider the default behavior in the event of something not being right. I was once really annoyed with my car as it went into what’s sometimes called ‘limp mode. In this mode, the car continues to drive but has very little power. This happens when something doesn’t check out and the engine management system can’t work out what it is but it’s not a disastrous failure. Limp mode is employed, shutting off the throttle to ensure that if there is a problem the car’s pace will be slowed, risk will be reduced. Consider what your process should do in the case where one of the inputs doesn’t conform to expectations.

Hard to undo

Failsafes can be used even when the consequences to customers aren’t obvious so consider if you need failsafes in place for processes where something may be hard to rectify. Consider if you pay all staff or refund your customers twice - getting that money back is going to be very difficult – it’s going to be hugely disproportionate to the effort required to stop the process and require additional energy input to confirm everything is right and release the brake.

Final Thoughts

Changing how we think about failure in IT systems isn’t straight forward and I hope this is helpful for some to start you on your journey or to improve your processes.

Failsafes are of course not a substitute to other process improvements designed to reduce failure frequency or mitigate risks in other ways but they are a useful way to think about failure and accepting that when it does, we want to decide now, how our systems will fail.

The three principles I suggest for thinking about a failsafe mechanism are. It must:

1. Determine how a process will fail – it is not designed to prevent failure

2. Establish a default state for the system which is safe

3. Require the system to need a constant input of some form of energy or force to allow operation

If you’re interested in the Otis Elevator design and my explanation here isn’t clear, why not check out the first few minutes of this video which shows well how the Otis Elevator worked: “The History and Evolution of the modern Elevator”: https://www.youtube.com/watch?v=lOjSs_MhIsI

or this rather dramatic one from the Discovery Channel: “Death-Proof Elevator - How We Invented the World”: https://www.youtube.com/watch?v=yNL094jbrGU

Note: These videos are not created by me and I have no affiliation in any way to the creators.

First Published 24/07/2021

All views expressed in this article are solely those of the author