The gold standard for troubleshooting is duplication of the problem. Simply put, duplication is the ability to reliably recreate a particular failure condition. The word “reliable” is key: it should be something you can summon forth on demand every time. Duplication is the beginning of a simple but powerful strategy that includes changing variables one at a time until you find the offending part, subsystem, or configuration setting. Reliable duplication gives purchase to your attempts at isolating the cause, allowing you to verify after each change whether or not you’ve fixed the problem. Reliable failures (seems like an oxymoron, eh?) make certain that, when you make a change and it begins to work, you’ve found the cause!
Please note, the machine must be nearly working for duplication to be useful. To illustrate why, imagine a car that has sat unused for 50 years. When you go to start it, nothing happens. You try again, with the same result. Feeling good, you give it a try a third time and a smile creeps across your face. “Aha!”, you think, “I’ve just duplicated the problem!” Yes, you have, but I’m afraid that it won’t help you very much in the context of a car that hasn’t run in 50 years. That’s because the number of things likely to be wrong is so large that a duplication plus isolation strategy will only tell you that…nearly every single component needs to be repaired or replaced. A project like that is much closer to manufacturing (making something work for the first time) than troubleshooting.
Contrast the above scenario with a car that stopped working 5 minutes ago. And, its state of not working is 100% reliable. Now we’re talking about a situation where duplication will be useful. A machine that stopped working recently is likely to have only a few things wrong. This is the type of scenario where you can start to make progress using a duplication and a “change just one thing at a time” strategy.
Dimensions For Duplication
If the failure scenario is robust, simply “trying it again” will be the only thing required to achieve duplication. However, let’s say that your attempts at duplication are initially thwarted. Before you jump to the conclusion that it’s an intermittent problem, give some thought to recreating the conditions that were present during the original failure. This may include the same:
- Time of day/week/month/year.
- Temperature, humidity, and other environmental conditions.
- Settings/configuration options.
- For mechanical machines: levels of fuel, fluids, batteries, etc.
- For digital devices: buffer levels, memory consumption, amount of network activity, other programs running alongside, etc.
- Speed/throughput/usage conditions.
- Operators: if the problem only happens when Joe is at the controls, then have Joe present during troubleshooting to do everything the same way (or have him show you so you can imitate).
Looking over this list, you may realize that you don’t know what was happening within these categories during the failure. You may also come to the conclusion that you don’t even know what the normal operating range is for these parameters in your environment. That may be an indication that it’s time to start collecting data.
For Better Or Worse
While attempting to duplicate, also take notice of the things that make a problem better or worse. Especially when a failure is qualitative (i.e., “too much” or “not enough”), you need to pay attention to this aspect of problem replication. In my interview with Jamie, a former motorcycle mechanic, this technique came up while discussing electrical systems:
If it’s an electrical problem, I’ll try to cordon off circuits that might affect what I’m focusing on. Like if it’s an ignition problem and I’m not getting the spark I want. Well, let’s disconnect the headlight, because that draws a fair amount of juice. Let’s turn off the turn signals, or let’s turn them on and see if the problem increases. Removing, but also adding.
You may be drawn to things that make a symptom better, but things that make a symptom worse are just as valuable to discover! Both provide valuable information and show a dependency in action. When you find a knob you can twist that affects the problem, good or bad, you are well on your way to understanding the issue.
Time, An Important Dimension
What is recent is easy to correct.
Tao Te Ching (Verse 64)
If a long period of time passes after a machine last worked, duplication will likely be an inefficient troubleshooting strategy. Entropy will transform any working system into a pile of worthless parts over the long run:
I really want to impress upon you that time is a major factor in all aspects of troubleshooting. As a limited resource, time is something you need to budget carefully while searching for a solution. Time also affects the likelihood of problems (they increase with the age of the machine, as shown in the graph above) and the resources available to you (these decrease with the passage of time, as shown in the graph below):
Over time, resources available to help you make a fix can vary widely. Number among these resources: community knowledge, technician know-how, manuals, tools, and spare parts. See the graph above for a visual representation of this phenomenon: with some lag, shortly after a machine is first produced there will be a peak in the resources available to help you troubleshoot. Take the Ford Model T as an example: between 1908 and 1927 over 15 million units of this iconic car were produced. At the very moment the first one rolled off the assembly line in 1908, there probably weren’t that many people who knew how to fix them. However by 1927, when there were millions of these cars on the road, it’s safe to say that nearly every auto mechanic working in the USA knew something about how to repair the Model T. They were everywhere! Not only that, but the tools and spare parts required were likewise ubiquitous: the simple economics of millions of these cars requiring maintenance meant that the market met this demand with abundant resources.
Fast-forward to today and how many people know how to repair a Model T? How do you get parts? It’s possible, but only a very small group of enthusiasts (antique car collectors and restorers) possess these resources. A hundred years from now, you can bet that this group will be even smaller! Anything can be repaired if you can throw enough resources at the problem, but realize that “fixing” a very old system will put you in a role similar to the original design engineers: prepare to pay the cost of discovering how to make it work, just like they did.
(Continued in Part 2: Failing To Fail)
- Header image: Mydans, C., photographer. (1935) Row of identical houses off Eastern Avenue, in Cincinnati, Ohio, showing backyard outhouses. Ohio River Valley is in the distance. United States, Ohio, Cincinnati, Hamilton County, 1935. Dec. [Photograph] Retrieved from the Library of Congress, https://www.loc.gov/item/2017759064/.
- Lao Tzu and Stephen Mitchell, Tao Te Ching: An Illustrated Journey (New York: HarperCollins, 1999), verse 64.