Any device that is configurable is vulnerable to errors stemming from those same settings being mis-configured. Did a switch or bit inadvertently get flipped, resulting in a malfunction? On a machine with many options, this might be hard to figure out. Therefore, a crude but useful way to determine if the machine is functional is to restore the default settings. On digital devices, this is usually very easy: you select one option and BAM!, the device reverts to how it came from the factory on the day you bought it. Mechanical devices also have “default settings”, but you might have to look in the manual to find them (i.e., there may be no magic switch that restores them like on a computer).
Once you’ve verified that the machine works with the default settings, you can begin to change the configuration back to your desired settings. If you do this using my “change-just-one-thing-at-a-time” philosophy, you should be able to identify the particular setting that is causing your machine to malfunction.
Alternatively, you may have created your own “default settings”: think of a network router that has been configured to work on your network (with your IP addresses, netmasks, routing table entries, etc.). A scheme like this usually means these settings are automatically loaded when the device starts up. Again, these could have been inadvertently changed, so restoring the machine to your desired settings (i.e., your own personal “defaults”) will at least eliminate the possibility that a configuration problem is the cause.
Off And On Again
Turn it off. Turn it back on. Whether known as rebooting, power cycling or restarting, this is such a simple yet powerful troubleshooting trick that it deserves its own section. Actually, given how often it works, a shrine would be a more fitting tribute.
Even as my troubleshooting skills have grown over the years, the reboot technique still has its place among my top strategies. Since I’m really into troubleshooting theory, sometimes I want to make things more complicated than they really are. But turning it off and then turning it on again has solved so many problems in so many different contexts that you always have to ask yourself: “Why am I not using this strategy?”
But Why Does It Work?
How the on/off switch returns a machine back to normal brings together several concepts that we’ve previously discussed. A restart usually restores a system to a simplified state, along two dimensions we’ve covered:
- Just the basics: many machines will start up in their most primitive state, with additional modules or subsystems deactivated. As noted, fewer subsystems in use lessens the chance of unwanted interactions among them.
- Configuration reset: usually, the longer a machine is on and used, the more its configuration will be changed. This means that, over time, the probability of choosing an error-prone configuration will increase. Since most machines will have a default “startup” configuration (usually designed by the manufacturer to always work), the on/off strategy can be shorthand for removing a bad configuration.
Lastly, restarting can solve another class of problems: corruption that occurs as a result of use. Over time, operators and circumstances will put a machine through its paces. Buffers and reservoirs will fill up or empty out, cruft will accumulate. Because many machines have automated start up procedures that restore a “clean” configuration, the on/off switch may catalyze a simpler state and clear away these issues.
Every troubleshooting strategy has a context in which it’s not to be used. While we may agree that the on/off switch may have the highest return on investment if you were to rank all available options, there are some times when it shouldn’t be used. It may seem obvious, but I must point out that not every machine can be turned off without serious consequences. A respirator keeping a patient alive or the lone engine on an airplane in flight are bad candidates for the restart strategy. Besides the obvious reasons of interrupting someone’s breathing or cutting off the only source of thrust for a plane in motion, restarting a machine has another big risk: IT MAY NOT COME BACK TO LIFE AFTER A RESTART.
This is a big factor for machines that have been running for a long time. I’ve personally observed computers that have been continuously humming away happily for years, only to die on reboot. As long as it remained in motion, things were fine; however, as soon as forward progress was interrupted, it stopped working. Humans may be the same way, our lives require a forward momentum that is sometimes not advisable to interrupt. Think of all the people who have died shortly after retiring…
So, always consider the context in which a system lives before reaching for that power switch. If the machine has been continuously running for a long time and is critical to your designs, be sure to have a good Plan B before restarting.
A Long-term Workaround…?
You may encounter problems that can be consistently solved by restarting. Would it be acceptable to use this as a workaround for the long-term? Would this exclude you from considering yourself a Master Troubleshooter? Doesn’t the Master Troubleshooter always want to know why something doesn’t work?
The answer? No. Remember that all troubleshooting decisions have an economic component. The cost of figuring out the why behind a failure may be prohibitive. If a simple reboot will right things and can be incorporated into your workflow, go for it! I’ve seen people automate restarts, rebooting a system at the beginning of every workday, before every shift, etc. Of course, make sure you take into account the full cost of unexpected interruptions before you make restarting a permanent part of your routine.
*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***