Carnies get nervous when an accident happens, because they say that “Bad luck happens in threes.” Even if you’ve killed someone whose first name is Darth, there’s always another Sith Lord running around. When you first see an ant crawling across your kitchen floor, you know there are thousands more, lurking unseen. The darkening of the horizon as storm clouds appear foreshadows rain. The small tip of an iceberg peeking above the water hints at a much larger threat below the surface.
These leading edge moments of discovery are first encounters with things yet to come. When troubleshooting, we also have these initial contacts with circumstances that may foretell of larger disasters. As my troubleshooting skills grew, I became increasingly skeptical of “one-off” breakdowns. There was usually more to the story. If you’re curious and tuned into what’s going on, you may be able to mitigate an impending disaster.
This section isn’t just for professional troubleshooters who work in high-risk industries (like nuclear or petroleum), because machine-related “disasters” can be lurking anywhere in your life. If the brakes failed on your car or if your house started on fire because of a missed warning sign, it might not be featured on the front page of The New York Times, but it would be still be a big deal—to you!
Look For The Signs
After something has been successfully repaired, it’s a natural instinct to take it easy. The excitement is over, so you grab a cold one, pat yourself on the back, and put your feet up on the desk. But, you may want to reconsider that habit. That’s because large disasters often start out with small, “normal” failures. Here’s some insight from the Deepwater Horizon accident, the largest marine oil spill in the history of the petroleum industry:
About seven hours before the Gulf of Mexico oil well blowout of 2010, a group of four company VIPs helicoptered onto the drilling rig in question, the Deepwater Horizon. They had come on a “management visibility tour” and were actively touring the rig when disaster struck.
There were several indications in the hours before the blowout that the well was not under control, in fact that it was “flowing”, that is, that oil and gas were forcing their way upwards from several kilometers below the sea floor. These indicators were all either missed or misinterpreted by the rig staff. The touring VIPs, two from BP and two from the rig owner, Transocean, had all worked as drilling engineers or rig managers in the past and had a detailed knowledge of drilling operations. Had they focused their attention on what was happening with the well, they would almost certainly have recognized the warning signs for what they were, and called a halt to operations. But their attention was focused elsewhere, and an opportunity to avert disaster was lost.
Andrew Hopkins, “Management Walk-Arounds: Lessons from the Gulf of Mexico Oil Well Blowout”
Maybe you’re thinking, “Wait a second, how can this be practical advice? There are so many little things that go wrong in the course of a day. Not all of them lead to deadly disasters!”
You’re right, not everything that goes wrong is the harbinger of a catastrophe. If you’re in charge of an operation that carries the risk of a major incident, the likes of which would show up on the evening news (I’m talking about sugar mills, oil rigs, nuclear power plants, etc.), you need to be engaged with the field of process safety. Any industry with significant risks of a major human or environmental disaster should have safety as a core part of their culture, woven into every process. That’s beyond the scope of this work, so I invite you to do your own reading in this area.
Instead, I want to make the connection with you in the role of a troubleshooter. Because catastrophes frequently begin with small breakdowns, they’re exactly the kind of things that front-line troubleshooters will be first on the scene to investigate. There are certain kinds of failures which should prompt you to dig deeper, as explained in the book Recognizing Catastrophic Incident Warning Signs:
There are many types of warning signs [for catastrophic incidents], including the following:
- Early indicators of failure that provide opportunities to take appropriate action. Process equipment that is not functioning properly may be prone to failure. Organizations sometimes ignore these on the basis that they will address the issues later (or if the problem escalates).
- Suggestions that a major incident may be imminent. An example might include a piece of process equipment reaching its end-of-cycle or retirement limit.
- Indicators that are less obvious and require detailed analysis. For this reason, a practical follow-up option is to conduct an audit to help ensure that programs and systems are managed effectively.
- Seemingly insignificant issues that, when combined with other warning signs, suggest a breakdown of management systems.
- Actual incidents with measurable consequences. If we ignore these, they can increase in frequency or magnitude and contribute to a catastrophic incident.
Center for Chemical Process Safety, “Recognizing Catastrophic Incident Warning Signs in the Process Industries”
I’ll add a few more items to this list of warning signs:
- New failure conditions: whenever something breaks down in a way that you haven’t encountered before, watch out.
- Malfunctions that span multiple systems: problems that strike in multiple places simultaneously are a big red flag for a coming catastrophe.
- Weird stuff: intermittent problems, issues that seem to magically resolve themselves (i.e., you investigate and everything seems to be fine), failures that are difficult to quantify because of conflicting or incomplete reports, etc.
That little bit of ice that sticks above the water is the phone call from a field technician about an unexplained interruption at a customer site. It’s a couple of identical bug reports from your best clients. It’s the “check engine” light coming on in your car. It’s that “something weird” you’ve never seen before in the plant.
Perhaps a “major incident” in your line of work might not be the end of the world, but it might inconvenience customers, put you out of business, or cause injury. Smaller organizations may have the same operational risks as large ones, but few or no staff dedicated to “process safety.” In these cases, those called on to troubleshoot may be the only ones with the visibility to take that extra step and flag a situation for further review, before it gets out of hand. Awareness is key: I just want to plant the seed in your mind that you can ask, “Is this part of something much bigger?”
Stay Awake A Little Longer And Run The Numbers
Whenever you see something that matches the “warning signs” list above, it’s time to take action. When I was in charge, I always liked to reserve extra time after troubleshooting for a brief audit of our infrastructure: a quick once over to make sure everything was in order.
Easy-to-access historical records and automated monitoring are extremely helpful in this regard and can help put a suspicious failure in context. If you’re only working with external symptoms and have no way to know if the rest of your infrastructure is okay, you’re flying blind. A malfunction could be a one-off, or it could be a bellwether of terrible things to come. If you have good data, you have a shot at telling the difference.
*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***
- Andrew Hopkins, “Management Walk-Arounds: Lessons from the Gulf of Mexico Oil Well Blowout,” February, 2011, pg. 3.
- Center for Chemical Process Safety (CCPS), Recognizing Catastrophic Incident Warning Signs in the Process Industries. (Hoboken: John Wiley & Sons, 2011), pg. 2.