What’s Changed?

1830 et 1833 - Periodical Le Caricature, 15 August 1833, Honore Daumier (1808-1879)
Notice anything different?
(image: Wikimedia Commons, license: Public domain)

The buildup to a machine failure can involve normal wear and tear, neglected maintenance, misuse, abuse, and Acts of God. Add one more to this list: the unintended consequences flowing from changes to how a machine is used, configured, or to its environment.

Lao Tzu says:

“What is recent is easy to correct.”

The Troubleshooter says:

“What is recently changed is likely to have caused the problem.”

Making modifications to a system or its environment can be the unintentional catalyst for a meltdown. Perhaps an increased workload pushes a machine to its breaking point. Maybe a recent software upgrade contains a bug that disables a critical feature. Perchance a machine gets too hot when someone accidentally turns off the air conditioning. Experience says that looking for what’s recently changed is a great starting point for any troubleshooting exercise. Look for these type of situations:

The Floodgates Have Been Opened (Or Closed)

Change type: Workload

If a machine does the same job, day in and day out, its behavior will be fairly predictable. However, when you significantly change a system’s workload, it can be the setup for a breakdown. Even though a machine may be designed for the additional burden, an increased workload often exposes wear and tear and foregone maintenance. It’s a chipper that’s been used to mince light brush, but then is fed a small tree and has a meltdown. It’s an aging SUV that has never left the pavement, then cracks an axle on its first off-road adventure.

I can think of numerous examples from my own life where increased usage has tipped a machine over the edge. I had a paper shredder that was just fine with a single sheet of paper at a time, but died a sudden death when I tried to do 5 at a time. I owned an old car that was fine for driving around town, but wasn’t up for a long road trip. And on and on.

Surprisingly, decreasing a system’s workload can also have negative consequences. When storing a car long-term, it’s recommended that you start the car occasionally to charge the battery and keep key components lubricated. You may never have considered it, but using a car is part of its maintenance, just like an oil change! Our bodies are the same way, a minimum amount of activity is required to keep them functioning. Astronauts living in a weightless environment have to battle severe atrophy of their muscles because of their lessened usage. Waterless urinals are another great example of the unintended consequences of decreased throughput: the high concentration of acids, normally diluted by water when flushing a traditional urinal, will eventually corrode copper pipes.

Finally, the work done by a machine doesn’t have to be any different than before, because simply more of it can cause trouble. For example, if your employer is thinking about adding a second or third shift, your machines’ usage could double or triple. Increased usage shortens the required maintenance interval: if you run a single 8-hour shift and do scheduled maintenance every 3 months, then expanding to 3 shifts (24-hour usage) will require maintenance on a monthly basis. If you don’t recognize this and stick to the old schedule, prepare for some problems!

A Switch (Or Bit) Was Flipped

Change types: Configuration, Maintenance, Software Version

How a machine is set up makes the difference between it doing useful work…and it doing nothing. Some systems have multiple work “modes,” and perhaps only one is useful for your purposes. This means that a well-meaning, but errant, alteration of a machine’s configuration can render it non-operational. Sometimes the erroneous modification will target a specific function, while confusingly other features will continue to work. A classic example of this is accidentally pressing the “mute button” on a TV. While the sound may be off, this doesn’t affect your ability to change channels.

You should also be suspicious of breakdowns immediately following routine maintenance. It wouldn’t be the first time a technician failed to properly reassemble a machine after a service visit!

Software upgrades should also be on your radar when looking at recent changes. Many a troubleshooting investigation has begun with these words: “So, I just got done running the updater and then it stopped working…” Reverting to a prior version is sometimes the only way to fix an issue while the programmers try to find the cause.

Don’t Be A Test Pilot

Change type: Overloading

Pushing a machine beyond its design limits will result in failures. The pressure behind these type of incidents usually comes from trying to handle too much business with too few resources. Feeling the pressure of clients and deadlines, people may respond by pushing equipment beyond its breaking point. Whenever we won a large contract, my mind would race ahead, thinking about all of the things likely to be broken as our engineers stretched our infrastructure to its limits.

Given that many machines have a built-in “margin of safety,” failures from overloading may take a while to fully manifest themselves. When the line is first crossed, nothing may happen immediately, and so further transgressions are seemingly justified. A truck designed to haul 5,000 pounds might make repeated trips loaded with 6,000 lbs. before something bad happens. Catastrophic failures from overloading can have a delay, because damage from overuse can take a while to fully reveal itself. When the chaos finally rains down, the machine might even be properly loaded, causing even more confusion about the origin of the issue.

When I was studying for my pilot’s license, my instructor and I were reviewing how to calculate the cross-wind component of various headings and wind speeds. The Cessna 172, the airplane on which I was training, has a “maximum demonstrated crosswind velocity” of 15 knots. “What if you exceed that?,” I asked. He said, “Congratulations, you’re now a test pilot! But, don’t ever be a test pilot.”

Pipes
Whether you use pipes, cables, or conveyor belts, when you connect machines together, a change in one might cause unintended consequences in another.
(image: Bob Duran, license: CC BY 2.0)

The Environment

Change type: Context

A change that causes a failure doesn’t necessarily have to originate within a machine itself. Rather, it can come from the environment in which the machine is installed.

The Weakest Link: this is where a system in a workflow chain has changed its output, affecting the input fed to a machine downstream. For instance, imagine two machines connected by a conveyor belt: one crushes gravel and the other puts the gravel into bags. The bagging machine requires the stones to be less than a certain size or else it will become clogged. However, the crushing machine can be set to produce gravel of any size. If the crushing machine is accidentally set to produce gravel that is too large, it will operate normally. However, the problem will appear downstream, at the bagging machine.

Machines that are connected have implicit expectations about inputs and outputs. If there’s no enforcement mechanism that ensures compatibility between machine interfaces, a problem can easily propagate. These kind of failures happen in the software world all the time, where an unintentional API change can render other programs that rely on the API non-operational.

It’s Getting Hot In Here: also be on the watch for changes in environmental conditions. Changes to factors like humidity, temperature, exposure to the elements, etc. can degrade performance and other times—you guessed it—a breakdown will happen.

Spot The Difference

You need to identify the information sources in your workplace that can answer the question, “What’s different?” A good place to start is with the people closest to the problem, ask them if they can think of anything that has changed recently. Other repositories may include things like change logs, maintenance records, and reports summarizing work that’s ongoing or recently finished. Some companies have procedures that mandate recording the time, personnel involved, and nature of modifications made to key systems. For the digital world, there’s typically a logging option for most programs, operating systems, or devices. In software development, diff’ing two versions of a program is a great way to figure out what has changed and will turn up candidates for further investigation.

The best case scenario for the “What’s changed?” strategy is this: you will discover a difference, use that as a starting point for your investigation, and eventually be led to the underlying cause. However, it doesn’t always work out like that. If you’re dealing with a Black Box, your understanding might be limited to “this change is causing the failure.” That is, you won’t know why the change is having the effect it does. Knowledge has a price: the cost of fully understanding cause and effect might be prohibitive. The implication of “spot the difference” is that the change can be put back to its former state: that might be your only workaround if the why is elusive.

Caveats

I’ve seen troubleshooters obsess over recent changes with the zeal of an Inquisitor. You’ve heard the phrase “correlation is not causation” and this is a prime example. Recent changes are a great starting point for your investigation, but don’t get so fixated on them that you miss other relevant factors. Until proven guilty, they should be treated as coincidental, not definitive proof of a particular cause.

Also, be careful to manage the human side of investigating recent changes as uncovering information in this vein usually requires interviewing people. Be sure to have a positive “I’m just trying to understand what happened” attitude, lest you be mistaken for an interrogator. I’ve seen people fixate on a recent change as a cause for a breakdown and then proceed to adopt a blaming tone. After that drama, frequently the investigation found the cause had nothing to do with the recent change. Oops. Playing the blame game is a bad idea if you want people to be forthcoming in the future. The changes described in the above categories usually result in unforeseen consequences (i.e., no one maliciously anticipated that changing X would cause Y to fail). The person who made the decision to make a change was probably acting with the best of intentions. Most people will gladly change their behavior going forward, if they are calmly made aware of how their actions were responsible for a breakdown.

*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: