After you’ve fixed it, be sure to learn from the experience and make improvements for the future.
- Is This Normal?: an ode to data collection.
- Zen And The Art of Routine Maintenance: eliminate trouble before it arises.
- Storm’s A-comin’: averting disasters.
- Troubleshooting Trees: a method to document and communicate your hard-won repair knowledge.
- Is It Really Fixed?: how to know when you’ve nailed it for good.
- Down To The Roots: using root cause analysis and the 5 Whys method to prevent failures from recurring.
- Moral Authority: using the emotion of a disaster to make meaningful changes.
- Making A List, Checking It Off: the checklist is a simple, yet powerful, way to guide the fix-it process. As a means to prevent trouble, it puts the best way of doing something in the hands of machine operators and designers.
- Failure Most Foul: the unfortunate reality of failures caused by fraud and sabotage.
- Release The Chaos Monkeys: learning important things about machines through stress testing and seeing them “at their worst.”
- You’re Not Done Until You Tell Someone Else: the final step is communicating what was learned.
- The Boy Who Cried Wolf: improperly tuned alerting systems will eventually be ignored.
- Found Objects: my Dad finds a mysterious object in the driveway…
- Did It Ever Work?: questioning a machine’s past performance.
- On Selfies and Showboating: Troubleshooting the Imminent Dangers of “Look at me!”: social causes of accidents, with strategies for prevention.
Why You Should Go Beyond Troubleshooting And “Clean Up”
Troubleshooting is a reactive response to a failure. The cause and solution may be unknown, but the strategies are designed to provide the quickest path to a resolution. If all you did was use the strategies and thereby gain reactionary skills, it would be an improvement to your life. However, as you grow, your gaze will turn to the proactive side of troubleshooting. That’s the focus of the “cleaning up” material presented here.
My concept of troubleshooting continues after the crisis has past and whatever was broken is fixed. The good dentist doesn’t want to find cavities in his patients’ mouths; the good firefighter would prefer to never see a house engulfed in flames. Likewise, the good troubleshooter would rather not have a system failure lead to a crisis (if we’re talking about what he or she would rather do, it would likely involve a beachside view and a fruity drink with a little umbrella). Yes, the virtues and strategies can help you be the hero when needed. But better yet is to never need to be a hero at all.
To achieve this ideal, you will be vigilant about learning from failures (both human and machine) and feeding that information back into your processes and procedures. You will use the moral authority from the aftermath of a crisis to make needed changes. You will collect data so that you understand what is happening at every level of your systems and infrastructure. You will probe deeply into breakdowns to understand the root cause. You will be anticipating failures, freeing spare resources and creating procedures to focus on being prepared for meltdowns. In short, you will transcend mere troubleshooting.
The argument for the reactive side of troubleshooting is self-evident: something is broke and needs to be fixed. The proactive side requires greater advocacy because its benefits aren’t as easily seen and slower to appear. These rewards require delayed gratification: making investments that may be long to bear fruit, taking action based on incomplete information, putting things in order before they exist, not taking action, considering alternatives. Is all this starting to sound a little philosophical?
Know the reactive, but keep to the proactive.