Described as a “one in 15 million chance”, NATS had processed 15 million flight plans and never experienced the problem before. Two identically named waypoints forced the system into “fail-safe” mode. And so, flights were grounded.
Countless organisations suffer operational mayhem due to an unexpected single point of failure in their system. For example, in 2021, Citibank sent out $900 million to all creditors instead of $7.8 million to one creditor.
They were relying on a dated piece of in-house software and missed important manual checks. Later that year, Facebook, Instagram, and WhatsApp were disconnected from the internet globally. This was due to a Border Gateway Protocol error – a single point of failure.
So, what can a single point of failure look like in your own business? And how can you spot them before they cause chaos?
What is a single point of failure?
When everything depends on one thing you have a single point of failure (SPOF). It’s like having only one key to open a door. Or only one person knowing the password.
This happened to QuadrigaCX, a Canadian cryptocurrency firm. The CEO died and took the passwords for $137 million in customer funds to his grave.
You can risk a software disaster from a single point of failure in many ways. Here are some common scenarios:
- You depend on a single cloud service for data storage
- You use a single firewall for network security
- You manage projects with a single software tool
- You use a single payment gateway for online transactions
- Only one software developer understands your system
You can also risk single points of failure due to over-reliance on key employees, suppliers, or pieces of equipment. The theme is always the same. If that one thing failed, there could be dire consequences.
The consequences of a single point of failure.
However you look at it, the consequences of failure would not be good. They’d impact your operations, cost you money, and risk your reputation. Let’s look at that in more detail with software SPOFs in mind.
When your system unexpectedly fails, your operations will feel a hit. In turn, that damages your customer service as they experience delays or limited services. Low productivity and missed deadlines are common symptoms.
Downtime always costs money. You lose sales while incurring costs to recover the situation. Depending on your IT support this can take days – even weeks. And if you break service level agreements you may have to pay compensation.
This can be a long-term consequence that affects how much people trust your brand. It can also prompt negative media coverage seen by millions. Recovering from such damage – as UK air traffic control will know – takes time.
The consequences of suffering a single point of failure can go deeper. Your team will feel the strain as they rush to put things right. Relationships may become strained with some feeling disillusioned about the business.
Meanwhile, you could now lag behind your competitors or become embroiled in legal challenges about data breaches. There’s nothing about a single point of failure triggering.
Case study: The “Black Book” scenario.
Back when social media meant two people reading from the same newspaper, a business analyst was on a quest. He wanted to understand how used or damaged goods were repackaged in a large pharmaceutical company.
Joined by two managers, the analyst stepped into a tiny office in one corner of the plant. He wanted to chat with the employee responsible for the process.
The analyst discovered they used a lot number (unique identifier) for each batch. They keyed this, plus the number of days since the start of the year, into the IT system ahead of repackaging.
But the employee jotted down all the assigned lot numbers in a little black notebook, along with the days (which she counted out on her fingers).
For security, she kept the notebook in her drawer. And this had become the regulatory requirement…
“Are the numbers available anywhere else?” asked the analyst.
“Oh no”, replied the employee, “That’s why I take the notebook home with me!”.
Jaws recovered from the floor, the company developed a custom software tool that managed all assigned lot numbers (and the days of the year). No need to rely on their employee carrying a black notebook around with her.
“Black book” scenarios pose great risk. And they’re incredibly common. Far better to build your exact requirements into software and avoid a single point of failure.
How to spot (and avoid) single points of failure in your software.
Prevention is always better than cure. Identifying your single points of failure, and reducing the risk or having a plan B, can protect your business from many consequences.
But don’t leave this to Steve in IT. Give it the attention it deserves. Create a high-level, cross-party team. Allocate time and resources to investigate potentially vulnerable areas in your business. With knowledge, you can determine the best way to eliminate your single points of failure, or at least reduce the risk.
Start by identifying systems that are essential to your business operations. That could include your CRM system, your finance system, or your production planning software. And get granular. What are the specific areas of concern? The areas that, if they failed, would cause the greatest harm.
Map out dependencies
Few systems run in isolation. So, consider how each one interacts with others. It’s useful to visualise this. For example, should your inventory management fail, would that harm your online sales system?
By understanding dependencies to critical systems, you can really see where your priorities lie.
From here, establish whether these critical systems have redundancy. Should a component fail, do you have a backup? For example, should you lose access to your CRM, do you have a backup of the data elsewhere?
Think about redundancy for your IT people and suppliers too. Should a key person become ill, or a supplier cease to trade, do you have a backup?
Simulations and tests
Another way to get clear on the “What happens if” question is to set up a simulation. Assess your system’s resilience by creating scenarios where different SPOFs fail. What happens next?
Monitors and alerts
Consider whether you have effective monitoring processes in place. If you can automate this system, you’re safe in the knowledge it’s working in the background. You’d get alerts if something wasn’t right. Knowing you can catch problems early makes a huge difference to what happens.
Whether in-house or using a third-party, regularly assess how your system might be vulnerable. Things change over time, so a robust system in January might be at risk by July. Newly implemented processes might cause potential conflicts, for example.
Invest in training
Nobody in an IT team should be irreplaceable. And everyone should keep up to date with new developments. Prioritising this strengthens your armour. You might want to cross-train employees too. Commit to this investment and know it will serve you well over time.
Have a disaster recovery plan
Should the worst happen, know how you’d handle it. Who would be responsible? What would your plan B look like?
It’s useful to document your disaster recovery plan for all critical systems. Some actions might be straightforward whereas others could be complex. By creating a documented plan, you all have something to focus on should disaster strike. It could save you days of lost downtime and significant reputational damage.
Nobody wants to experience a single point of failure. And yet, even the largest organisations have suffered due to software weaknesses. The risk is real. Don’t overlook the importance of allocating resources and budget to help manage and mitigate your SPOF risks. And keep doing this over time, it’s not a one-off job.
Never considered doing this before? Start today. Introduce some simple steps such as identifying your critical systems and whether they have redundancies.
Should you want to discuss what you discover, we’d be happy to help. Please get in touch for an informal chat about your business software.