One thing I really hate is making the same mistake twice. In my tenure as a CTO, I always wanted to move forward, never making the same error again. I tried to clear the way for our team to make new mistakes.
Some failures happen over and over again. As you gain experience with a machine, you’ll begin to see all the different ways it can break down. As discussed in “Same Symptom, Different Causes,” there will be times where a machine will appear to be broken like a previous time, but actually have a completely different underlying cause (and associated fix). Figuring this out can be costly, so let me introduce a way of preserving and communicating your hard-won troubleshooting knowledge: a troubleshooting tree.
A troubleshooting tree is a formal description of the troubleshooting process for a particular problem. The tree walks you through a rehearsed fix-it routine, with branches along the way where you stop, gather information, and make choices about which way to proceed. Let’s look at an example:
In this excerpt, you can see standard operating procedures, symptoms of trouble, and the associated possibilities for remedying them. For example, if the locomotive’s engine won’t start (“Engine Starting → If Defective”), the tree has four branches leading to different solutions:
- Weak battery → Check → Battery
- Start contactors don’t close → Check → Interlocks
- Faulty fuel supply → Check → Overspeed trip; Clogged strainer or filter; Relief valves stuck; Shutdown valve.
- Uneven speed engine hunts → Check → Oil level in governor.
Looking over this locomotive troubleshooting tree, you can see some symptoms have been identified as the result of a single cause (e.g., “Start contactors don’t close → Check → Interlocks”). Others, like the engine not starting, can have multiple causes, which need to be serially checked until the problem is remedied. Some of the fixes in the tree are very quick (e.g., flipping a switch), while others refer to detailed procedures elsewhere in the guide that might take several pages to describe.
For the operator, information like this is solid gold. The time savings from a guide like this can be immense: even just one branch on this tree might have taken days of trial and error to figure out!
Planting The Tree
When planting a new tree in your backyard, there are certain things you must make sure are present: the right soil, sunlight, water, and climate. Likewise, there are things you should make available to the reader in order for a troubleshooting tree to be useful:
- Necessary operating conditions, useful tools, easily missed assumptions, and setup requirements.
- The path of investigation, with all the likely possibilities for each decision node explained.
In any kind of documentation you write, you must specify the conditions under which your methods are valid. Should a test be performed with a warm or cold engine? Is a certain software version required? Are there specific tools or resources that must be on hand to perform the operation as described? S-p-e-l-l it out, because if you don’t you can be sure that someone, somewhere will misunderstand what’s needed. Also, don’t count on “normal” conditions to exist. After all, the fact that the reader of your documentation is troubleshooting is a strong indication that the conditions present are not normal!
Finally, when it comes to any kind of test, remember to include the full range of possibilities in your decision nodes (unless, of course, it violates the Laws of Physics or the possibility is otherwise covered in your prerequisites). If you’ve asked the troubleshooter to take a reading from a meter that reads from 0-100, you shouldn’t just list decision arrows for the ranges of “81-90” and “91-100”. What if the meter is reading “26”?
After you’ve described the prerequisites, you can start to create the troubleshooting tree itself. Lead the reader from start to finish, passing through the points needed to make the diagnosis and then finally to remedial actions:
The “boxes and lines” layout is optional, you could represent the same information using words in a list. However, the visual structure of connected boxes makes it easier to understand and follow a particular troubleshooting path.
Troubleshooting trees, like real trees, are living and evolving organisms. Whenever a new cause is discovered, be sure to add a branch to your troubleshooting tree. Let’s return to our fictional Whoozy-Whatzit example above. After deploying the troubleshooting tree to technicians, it’s determined that the proposed remedy (i.e., turning the Flum-flumpeter Switch to “ON”) doesn’t always fix the problem. In fact, a new cause for the same symptom is discovered. When that happens, you need to grow your troubleshooting tree:
Ideally, you’d have a recommended course of action for every possibility (i.e., for every “yes” branch there would be a “no” branch and vice-versa). In this example, you can see that the reader is left hanging if the Flum-flumpeter Switch is “ON” and the Whoozy-Whatzit is in fact plugged in: there’s a missing branch for “yes” off the “Is the Whoozy-Whatzit plugged in?” decision node. What then?
Don’t feel like you have to cover every possibility to create effective documentation. While every branch you could take might be theoretically possible, there’s a variety of reasons why you might choose to not include it:
- The probability of the failure is low (compared to other failures) or has never been observed.
- A desire to keep the documentation brief: maybe you only get a page or two for troubleshooting information in the back of a product manual and so you can only cover the most common failures types.
- The cost of researching a failure condition is prohibitive or would take too long.
- The information is geared towards a certain user group who are not likely to encounter the failure condition. For example, maybe you have a product that is installed in airplanes and boats and so you have a separate troubleshooting trees for mariners and aviators.
From An Acorn To A Mighty Oak
If you are vigilant about updating it, over time a troubleshooting tree will represent the sum of all the knowledge you have for a particular set of symptoms and causes. Back to the point made at the beginning: always be moving forward with your troubleshooting knowledge. If you’ve already solved the problem and haven’t documented it, you’re blowing it!
Creating and maintaining documentation is a necessary step for growing from “small” to “large.” Freeing information in this way allows an entire team to support your systems. Around Discovery Mining, we used to joke about our “bus number.” This is a morbid (and hilarious!) way to think about redundancy: how many people on your team could get plowed under by local public transit, and still life would go on? If the answer is nobody, you have a “bus number” of zero. Of course, people step away from their responsibilities for less dramatic reasons. They get sick, retire, receive a promotion, go on vacation, or leave to take a job somewhere else. Speaking of teams, documentation also allows you to introduce specialization by breaking your team into those that find the problems and those that fix them. Better yet, you can push your documentation all the way out to your customers and have them solve their own problems!
Even if you’re troubleshooting alone with no team to support you, documentation like troubleshooting trees can be extremely useful. If you have to maintain a lot of systems, memorizing all the possible symptoms and remedies might be impossible. Also, documentation is a calming influence in a crisis. High-pressure situations favor the simplicity of following something like a troubleshooting tree, versus having to figure it out, again.
*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***