Zen And The Art Of Routine Maintenance

Fighting a fire
As a skilled troubleshooter, it’s great when you can step in and be the hero. Better still is to prevent trouble from ever starting.
(image: Andrew Magill, license: CC BY 2.0)

I’ve heard that being a doctor can sometimes be a downer because all day long you are seeing people at their worst. No one randomly schedules an appointment with their physician to enthusiastically reveal how good they’re feeling!

Likewise, if you’re called on to troubleshoot, it’s because something has gone wrong. Your job or business may be on the line. Troubleshooting can be fun, especially if you bring a sense of curiosity and discovery to the enterprise. However, the baseline experience usually involves someone who has been inconvenienced by a machine failure. We only bother to fix things that matter to someone.

All this drama can be an invitation to heroism, but it’s best to not get hooked on the excitement. If you find yourself in crisis mode all the time, you’re either an adrenaline junkie or haven’t done enough preventative maintenance. You can look at expert troubleshooting skills like having a black belt in Kung Fu: it’s nice to know that you’ll be able to kick some butt when the time comes. However, if you find yourself in drunken bar brawls all the time, you’ve probably tuned out when your sifu emphasized self-control and peaceful ways to avoid confrontation.

In his introduction to The Art of War, Thomas Cleary relates this parable relevant to our topic:

According to an old story, a lord of ancient China once asked his physician, a member of a family of healers, which of them was the most skilled in the art. The physician, whose reputation was such that his name became synonymous with medical science in China, replied,

“My eldest brother sees the spirit of sickness and removes it before it takes shape, so his name does not get out of the house.

“My elder brother cures sickness when it is still extremely minute, so his name does not get out of the neighborhood.

“As for me, I puncture veins, prescribe potions, and massage skin, so from time to time my name gets out and is heard among the lords.”
The Art Of War, pg. xi

When it comes to troubleshooting, you want to be like the eldest brother. Routine maintenance is your opportunity to prevent trouble before it has a chance to take shape.

Maintenance Windows

A maintenance window is a prearranged time when preventative maintenance is done. Because a machine is idled, work will be slowed for the duration of the window. You wouldn’t agree to this without some upside: the benefit gained is that you will reduce costly unscheduled downtime. As clients come first, don’t expect that you’ll be able to have your maintenance windows at convenient times (for you), like on a Monday morning from 9am-11am. If you can convince management of the benefits, expect something more like a Sunday night when the business is closed. Again, tradeoffs. Would you rather give up a Sunday night once a month, or see your co-workers in their pajamas because the team had to be called in on a Tuesday at 3am for a meltdown? Unless you have co-workers who look good in their pajamas, the choice is clear. For some companies, the cost of unscheduled downtime can be in the millions of dollars per hour. All it will take is a couple of high-profile incidents to sell your “radical” notion of using maintenance windows as a preventative measure.

Shiroi Koibito cookie production line - Sapporo, Japan
Any modern production environment will have more opportunities for maintenance than time will allow. You’ll need to prioritize…
(image: © Jason Maxham)

Planning A Maintenance Window

There is an art to planning and executing a maintenance window. Typically, the time you have to work with is finite and fixed on the calendar. If you are fastidious about collecting all of the possible maintenance tasks you could pursue, you may find it exceeds your allotted window duration many times over. That’s okay, that will simply force you to decide what’s most important. Over time, you’ll get an intuitive sense of what to prioritize by this formula:

Chance of Failure × Cost of the Failure = Expected Cost of the Failure

Let’s work through an example of choosing between some possibilities for a maintenance window. You are employed at Widget Inc. and, after some costly downtime, you have negotiated a monthly 1-hour maintenance window from management. On the top of your list are the following two items:

  1. Replace the worn ball bearings on the factory’s conveyor belt:

    Spherical ball bearings @ 100 innovationer @ Tekniska Museet
    (image: Pelle Sten, license: CC BY 2.0)
  2. Upgrade the software that controls your computer-controlled routers (in manufacturing, a router is a tool used to hollow out a piece of hard material, like wood or metal):

    CNC Control Panel and Automated Machinery
    (image: Fagor Automation, license: CC BY-ND 2.0)

Given that we’ll only have 1 hour to do maintenance, how do we choose? We’ll start with a basic analysis, using the formula above. For the bearings on the conveyor belt, we’ll note that there’s only one conveyor belt in the whole factory. When it’s down, the company is effectively shut down. To bootstrap a comparison, we’ll roughly calculate the cost by spreading the company’s annual revenue over each hour, using that to estimate the cost of an incident. If an unscheduled bearing change in an emergency situation will take about 6 hours (¼ day) and the company makes $100 million dollars a year, we can get a sense of how much an incident will cost:

(.25 days ÷ 365 days) × $100,000,000 annual revenue = $68,493.15

The automated routers are a different story: the company has several of these machines and if one malfunctions, the aggregate speed of production is reduced by 1/n (where n is the number of machines). There are 10 machines and so if 1 breaks down, the slowdown will be:

1 ÷ 10 = 10% reduction in line speed

Let’s say it takes one day to repair a router in the case of an unplanned outage. Now, we can estimate the cost of downtime:

(1 day ÷ 365 days) × (1/10) × $100,000,000 annual revenue = $27,397.26 per incident

At this point, if only one maintenance item could be accomplished, you’d be biased towards servicing the bearings ($68,493 > $27,397). By the way, the cost of the 1-hour maintenance window itself can be calculated as well:

((1 ÷ 24) ÷ 365 days) × $100,000,000 annual revenue = $11,415.53 per 1-hour maintenance window

That’s a pretty expensive maintenance window! It would imply that incidents with an expected cost less than the cost of the window shouldn’t be given time for maintenance. However, there are many other things to consider when calculating the “cost” of downtime. Idling your workforce, missing deadlines, and many more factors will result in expenses both tangible and intangible (e.g., loss of goodwill with a key client).

These cost of downtime calculations are a good first start, but you may have noticed one critical piece of information missing: the relative probability of each kind of failure. You may be drawn to changing the bearings, given that the cost of a bearing failure on the assembly line is 2.5 times more expensive than a router failure ($68,493 ÷$27,397 = 2.5). But, if a router’s rate of failure was 2.5 times that of the bearings, they would extract an equal cost. One more thing to think about is the relative number of each item within your the total infrastructure. Let’s say that there are 20 bearing units in the assembly line (and 10 routers, as previously noted). Our calculations need to take into account that there are twice as many bearings that could fail.

We’ve come a long way in our analysis, but our equations are still missing some information:

  • Expected cost of a bearing failure = 20y × $68,493
  • Expected cost of all routers failures = 10z × $27,397

Where y and z represent the rate of failures of the bearings and the routers, respectively. Even at this point, we can do some quick math to guide our decision. What would the failure rates of bearings and routers have to be for their total cost to be equal?

10z × $27,397 = 20y × $68,493

z × $273,970 = y × $1,369,860

z = ($1,369,860 ÷ $273,970) × y

z = 5y

For these two scenarios to be equally costly, a router would need to be 5 times more likely to fail than a set of bearings. Based on your experience with bearings and routers, this is one more way to evaluate your options without historical failure data.

Known Unknowns

When the chances of various failure scenarios are unknown, like in our example above, choosing maintenance tasks will a judgement call. If you are just starting to track your breakdowns in a systematic way, you’ll need to estimate and then revise later as the evidence accumulates (see “Is This Normal?” for ideas and inspiration on data collection). Even if you have been collecting data, there will be some failures that have never happened, meaning you’ll be trying to prevent the unknown. If that seems like an amorphous thing to advocate for, backing you is the collective experience of millions of technicians who know that an “ounce of prevention is worth a pound of cure.” It’s better to deal with a machine on your terms, not in the middle of something important like trying to meet a deadline. If you forego regular maintenance, you’re basically letting the machine decide how it will inconvenience you. Personally, I’d rather be in control of the timing. Where to start? If you’ve been around your organization for any length of time, you’ll probably have an intuitive sense of where to devote your maintenance resources. Otherwise, do an inventory and rank your systems according to cost of downtime.

The Right Frequency

If you’ve ever changed the oil in your vehicle, congratulations, you’ve conducted a maintenance window. You set aside time (maybe when you passed an oil change shop and looked at the “Next Oil Change” sticker in shock) and your car was unusable while the oil was being changed. You weighed the inconvenience of spending time drinking bad coffee and reading old magazines in the waiting room of a service station against the cost of an engine failure.

Speaking of changing your oil, that’s a great example of where you have to decide how closely you adhere to a manufacturer’s guidelines. Although every mechanic I’ve talked with endorses frequent oil changes, they are quick to admit that your owner’s manual may recommend a frequency that is excessive. You might notice the difference if you drove your car for 2 million miles, but consumers (buying new) only keep their cars for an average of 6 years.

In my experience, I’ve seen recommended maintenance intervals range from opportunistic revenue generators to woefully inadequate (or non-existent). On the opportunistic side, my printer is always telling me to change the ink cartridges. I’m not surprised, given that the liquid inside is worth more than gold. Parts and service can be an important revenue stream for a company, so don’t be shocked when encountering aggressive replacement schedules. Given the economic incentives at play, be sure to deploy an appropriate amount of skepticism and double check the need against your own failure data. Also, keep in mind that routine maintenance doesn’t have to expensive, I would do a quick visual inspection of our office infrastructure on a weekly basis that took no more than 10 minutes.

On the other side of the coin, you may find a manufacturer recommends a maintenance schedule that allows an unacceptable number of failures. Maybe their recommendations are geared towards the “average” customer, while your usage profile is more strenuous. I’ve also seen systems that don’t come with any maintenance advice at all. If you are dealing with a custom-made machine that you cobbled together, obviously the ideal maintenance schedule is something unknown that will need to be discovered. By you!

Go Ahead And Let It Fail

If treated it as a learning opportunity, you don’t need to fear failure. This applies to life as well as routine maintenance. When Discovery Mining first started out, one of the most annoying things to maintain was the printer. The darn thing was always breaking down at the most inopportune times—usually right before an important sales meeting. Of course, when that happened, it was a “drop everything” emergency that required immediate attention.

I’ve talked before about “listening to machines” but this printer was a real chatterbox when it came to error messages. Just like being trapped in a boring conversation at a party, you eventually tuned it out. It wanted everything to be replaced, on a constant basis. At first I was responsive, but there didn’t seem to be a tight correlation between replacing parts and preventing breakdowns. I began to suspect the rampant “replace X, it will fail soon” error messages were more of a revenue-enhancing ploy by the printer company.

My solution was to get a second printer. This put an end to the crises: if one of the printers was malfunctioning, you could always use the other one. I also started to note (but not act on) the various error messages, instead letting each printer go until it couldn’t print any more. Doing this, I began to get a better sense of what was required to keep these printers up and running. That is, which preventative maintenance items were worth my time. The buffer provided by the second printer gave me the breathing room to learn what was really going on.

Adding a layer of redundancy is a general method to enhance (or even replace) your routine maintenance program. If the cost of failures is low and swapping is easy, then consider the very efficient “do nothing” protocol (patent pending). If you think about it, this is the scheme most people apply to their TV remote control. You don’t test the batteries in your TV remote every day (if you do, get help now!). When the batteries die, you simply replace them. Except for the pain of having to leave the couch, a dead battery in this context is no big deal. Another bonus is maximizing the use of your resources: you get the absolute most out of something by using it until the end.

Put The Routine In Your Maintenance

I’ll leave this discussion by noting that periodic maintenance requires discipline. If you’ve taken the time to study a system and determine that maintenance will head off trouble, but then fail to do it… This is the most tragic kind of failure, when you know but fail to act. Make it easy on yourself, set up automatic reminders. I also favored vendors that would were good at keeping to a maintenance schedule, independent of me badgering them. Find people like that, and you’ll vacation with confidence!

*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: