You’ll recall that the standard troubleshooting exercise has 2 steps:
- Identifying the problem
- Executing the fix
This simple problem-solving model is useful because a mature understanding of these two steps leads to the possibility of forgoing either one.
Less and less do you need to force things,
until finally you arrive at non-action.
When nothing is done,
nothing is left undone.
Tao Te Ching (Verse 48)
Learning to refrain from harmful or inefficient action is a difficult skill to cultivate, but it separates the amateur from the Troubleshooter. Not every malfunction can be fixed with a wrench, nor with money, and there are conundrums which are best left alone and worked around instead.
When it comes to the value of non-action, I’ve discussed several strategies that allow you to be ignorant of a malfunction’s specific cause, effectively skipping step #1. Many solutions do not require that you understand the why behind the failure. Under this banner we include strategies like the restart and “shotgunning” (replacing several components at once). While the restart often works, when it’s chosen in lieu of an investigation, the technique illuminates little about the origin of a problem. In effect, it’s a substitution that favors the pragmatism of getting back to work over furthering your understanding.
Likewise, fixing something is also optional: always keep in mind that pursuing step #2 is not required. Knowledge gained during the problem discovery phase may show that a repair would be too slow, too costly, or perhaps even impossible. These scenarios provide the rationale for choosing to replace a failed machine instead of fixing it. Or, even better, you discover a workaround or optimization that completely obsoletes the failed component. That is, neither repairing nor replacing, but instead reconfiguring a particular workflow to do without.
With these concepts in mind, let’s go all the way down the path of blissful ignorance. Within the context of repair, a strategy that chooses to avoid costly knowledge in favor of expediency is starting over. Let’s examine the costs and benefits in both the digital and analog realms.
The Price Of Knowledge
I’ve previously discussed repair as the movement between the current broken state back to an ideal operational one. However, when making this transition, troubleshooters often rely on the natural advantage conferred by the nearly-working state of a recently broken machine. If a car stopped running just yesterday, or you were able to send an email 5 minutes ago (but not now), then it’s likely that you are looking for just a single problem to fix.
In my experience, this is the core of troubleshooting: looking for that lone thing that is preventing a machine from working. The nature of the almost working machine leads to a bias for minimalism in your repairs, and also to the key principle of Change Just One Thing At A Time.
While repair may be an attempt to restore a machine back to the manufacturer’s ideal, we typically stop work much sooner than that, choosing instead to focus on whether the machine can continue doing useful work. The alternative would be costly madness: to be absolutely sure that a machine conformed to the ideal, you would need to check every single component for correctness.
For example: if your car won’t start, and replacing the battery makes it run again, you don’t then check every screw, spark plug, fuse, hose, and weld. Although we may desire to know that everything else is in perfect working order, the price of that knowledge is simply too high. As long as a machine is meeting our needs, we’re often quite happy to infer that the majority of the components are okay.
The Domestic Deleter
The digital world presents a different set of tools for ensuring the correctness of a system, allowing us to achieve a perfection that would be costly and repetitive in the analog world. It also means that starting over can be a viable problem-solving recipe.
To illustrate, let me tell you a story from the trenches of IT. While I was looking for a job after I graduated from college, I did some freelance consulting. I worked with a number of small businesses, which led to referrals to help people out with their home computers. As I was invited into these people’s homes in the evening, these jobs were always very interesting. During these house calls, not only did I get a chance to hone my tech chops, but I also got an interesting “slice of life” view into the local community.
One client I worked with had that perfect blend of frustration and comedy that made you laugh while shaking your fist at the heavens. She also made me question aspirations I had about turning my little consulting business into a full-time career; after this episode, I kicked my job search into high gear. At first glance, this woman’s to-do list for me was nothing special: installing some software, troubleshooting her Internet connection (dial-up!), etc. However, something about her computer was…strange. I hacked away in vain for a while and then sat her down for a little Q&A. We had the following conversation, which I was not prepared for:
Me: “I don’t understand what’s going on with your computer. It’s very unstable and I think files are missing…”
Client: “Well, whenever I come across a file and I don’t know what it does, I delete it.”
Me: “What?! That must happen a lot.”
Client: “Not really, I’m quite busy and don’t have time to look in every folder.”
Me: “For your computer’s sake, that’s a relief.”
I then calmly explained why her “deleting what I don’t understand” strategy was problematic. Convinced by my air-tight logic (and exasperated tone), she promised to change her eradicating ways.
But, what do I do now? What exactly did this lady delete? She had been at it for a while and couldn’t remember which folders she touched. The default installation of a modern operating system can include hundreds of thousands of files. Applications can easily contribute an equal number. I knew the work required to figure out precisely which files were missing would be beyond tedious (yes, I checked the “recycling bin,” but she was thorough in her efforts). There was only one good solution, and that was to start fresh. So, I convinced her to let me reinstall Windows from scratch.
Reinstalling an operating system may be time-consuming, but it requires little effort: the installer does all the hard work for you. Beyond supplying some basic information like your language, location, and keyboard type, it takes care of the rest, making sure that all those files end up in the right place. I awkwardly watched TV with her husband and the family dog in the living room while waiting for the install to finish.
Husband: “So, you want a beer?”
Me: “Oh no…”
Husband: (shouting to the next room) “Honey, can he have a beer?!”
Client: (shouting back) “No, he’s working on my computer!”
Me: “That’s okay, I’m fine.”
Welcome to the glamorous world of IT.
Levels Of Rebuilding
If you’re having a problem with your refrigerator, and don’t know what’s wrong, it would be folly to replace every part. It might fix the problem, but pursuing this strategy would come at a terrible price in terms of time and materials. In the digital domain, however, the costs are different. Replacing everything might be expensive for a refrigerator, but this is exactly what reinstalling software accomplishes. The nature of ones and zeros, which can be copied perfectly and automatically with ease, means you can rebuild something from scratch with just a keystroke.
In the abstract, when is starting over a good option for repair? We need the following conditions to be present:
- The cost of problem discovery is high.
- A machine model that is easy to replicate or enforce correctness upon.
You can see that my situation with the woman who liked to delete files fit this mold perfectly. Manually figuring out which files were missing would have been costly, boring, and time-consuming. I wasn’t about to move in and become a regular on their couch, even if I did get a say about what we watched on TV and the privilege of having a beer! At the same time, Windows’ automated installer ensured that the operating system could be restored perfectly and with little effort, fulfilling the second requirement.
Examples in the analog world are harder to come by: that is, repair situations that favor wholesale rebuilding over diagnosis. A Formula One racing team may completely rebuild an engine after only 500 miles of driving, but that’s done in the name of preventative maintenance. Purely as a troubleshooting strategy, rebuilding an entire physical machine from scratch is typically not cost effective because buying a replacement or skillfully isolating a problem using logic will usually win the cost battle. The reason why gets back to the economics of troubleshooting: repair labor is simply not as efficient as factory labor.
What is relevant to both the digital and analog worlds is that you have a decision to make about which level you will choose to rebuild. Using a malfunctioning computer as an example, you could start over at any of these various levels:
- Operating system
In terms of the extent of the rebuild, these are listed in order of most-to-least, with each level being inclusive of the next. If you reformat your disk (filesystem level), then you must also reinstall the operating system, plus applications, which subsequently must be configured, etc. I’ve done rebuilds at each of these levels, depending on the situation.
To show you that this concept is universal, we could also make the same type of list for rebuilding a house. Imagine a home that was damaged, perhaps in a storm or fire. You would also face the same dilemma about which level to begin rebuilding:
- Finish work (drywall, trim, etc.)
Just like in the computer example, going from top to bottom, each of these levels is inclusive of the ones below. Rebuilding from the footings would imply that everything else would also need to be restored as well: the foundation, framing, plumbing/electrical/HVAC, etc.
Because the scope of the rebuild expands as you go up through these levels, the cost-effective way to proceed is to see if rebuilding at the smallest level will meet your needs. In our computer example, that would be to consider the list in reverse order: Configurations → Applications → Operating System → Filesystem → Hardware. Indeed, experience confirms this tack: refreshing configuration files are often a quick fix to application level problems!
Clean-Sheet Innovation And Rebuilding…Trust
There’s a fine line between engineering and troubleshooting, so a rebuild can also be an invitation to rethink a machine’s design. Especially if you’re working on a custom-built system, a rebuild is an opportunity to make improvements that have been discovered after a machine has been in use. This happened all the time in my software career: a decision to remake a faulty or bottlenecked component opened up the door to big breakthroughs. Rebuilding is a great way to free yourself from the constraints of prior decisions, which were made in a different context that likely has changed.
Finally, there’s one more aspect of rebuilding that transcends optimization. In my tenure as CTO, I managed many incidents affecting security: from virus infections on employee computers, all the way to a malicious hacker intrusion that nearly bankrupted the company. In these cases, rebuilding went beyond an advantageous troubleshooting strategy that leveraged the automated correctness of software. Instead, it was a vital bridge that got us back to a system we could trust.
*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***
- Lao Tzu and Stephen Mitchell, Tao Te Ching: An Illustrated Journey(New York: HarperCollins, 1999).