The Big Idea

Toaster in Black and White
A toaster…
(image: Tom Hart, license: CC BY 2.0)

Exciting or mundane, every day your life is filled with purpose. You wake up, attempt to advance your goals, and then fall asleep again at night. All around you are machines, amazing inventions that can amplify your intentions; it is probably difficult to imagine your life without them. We’ve developed a symbiotic relationship with the systems at the heart of our advanced industrial civilization. When they work, you are able to accomplish amazing feats, experiencing things and enjoying a standard of living that would dumbfound your ancient ancestors. When they don’t…well, that’s why you are reading this, right?

When a broken machine is standing between you and your goals, it’s time to take action and get your hands dirty. Troubleshooting is about getting back to normal, about making something work again. Clever inventions enhance our lives in so many ways, but when they break down you can be left feeling powerless. I want to put you back in control by giving you the tools and mindset needed to have a healthy and productive relationship with the machines in your world. It doesn’t matter if troubleshooting is your profession, as we all are dependent on machines and are negatively impacted when they fail.

Why become better at troubleshooting? First off, it might be a matter of life and death. There are times when we literally place our lives in the hands of a machine: breathing from an artificial respirator in the hospital, or the smooth operation of the engines on an airplane that is cruising along at 37,000 feet. In these cases, the benefits of a quick resolution in the event of a breakdown are self-evident. Even when lives aren’t on the line, speedily getting a machine working again allows you to get on with living your life. Lastly, there’s nothing quite like that satisfying “aha!” moment when you get to the bottom of an issue and fix it yourself. Being effective in this way stirs the soul.

Maybe you’d like to be the hero at home by being able to fix things yourself. On the job, machines can either help or hinder your work objectives; understanding them from a troubleshooter’s perspective can be the key to your livelihood. With the right attitude and knowledge, you can easily save the day at the office, factory, farm, or shop as well. Tights and a cape are optional. For professional troubleshooters, what I will describe are the fundamental skills to keeping your job and moving up. If you manage troubleshooters (and, if you manage humans, you do), then knowing the core strategies will allow your team to do more with less. Solving problems that make others cry will be especially good for your reputation and your wallet. Don’t worry, I’m not going to ask for a cut of your inevitable pay raises. The fact that you learned it from me is payment enough.

The Crucible Inside The Cubicle

The origin of The Art Of Troubleshooting can be traced back to 2002, when I started a software company with three friends (it was called Discovery Mining, if you’re interested). Looking back, I can say that the most interesting part of my job as CTO was troubleshooting really complicated problems. And, Discovery Mining had the perfect recipe for monster-sized issues: millions of lines of computer code, nearly 1,000 servers in multiple data centers, teams that were thousands of miles apart, and demanding clients spread across the globe that were always pushing the limits of our capabilities (sample quote: “I only tried to put 10 billion documents in a folder…do you think that would slow it down?”). As you can imagine, we had to deal with some very tricky problems.

Our company, a complicated web of interconnected systems, both human and machine-based, gave me a peek into the surprising complexity of our world. Surprising because, simple machines, when interconnected, can behave in unexpected ways. Oh, and that “simple machine” is probably more complex than you thought, once you start to carefully observe it.

The methods described here were largely discovered in the cleansing fire of a startup experiencing tremendous growth. Some are profound, some are so simple they border on the obvious, but they all are the result of butting up against the same problems over and over again. Eventually, I began to take notice of what worked and what didn’t; from there I started to abstract some general principles of troubleshooting. While I feel our company’s high-pressure environment was unique in driving me to think deeply about troubleshooting, I’m not claiming I’m the first human being to ever use these problem-solving methods. Many were taught to me by colleagues. Others were discovered accidentally when nothing else worked.

Naming Names

Doing research for the book, I interviewed some great troubleshooters, asking them explicitly about their methods. I’ve also asked plenty of questions about fixing things in casual conversation. Yes, I’ve been known to bring it up at a party—but only after I’m buzzed—pressing a slightly intoxicated stranger for the finer details of their latest computer or car breakdown can be highly amusing. Over time, I began to notice that many of the strategies I had discovered through my own trials were also known to others.

Isar-II Nuclear Power Plant
…and a nuclear reactor. The underlying principles of fixing them are the same.
(image: Bjoern Schwarz, license: CC BY 2.0)

Discovering that the essence of what I knew was in use elsewhere, in a variety of occupations, was a big breakthrough. This confirmed my hunch that there was an opportunity to describe troubleshooting in general terms, across fields and industries. However, one thing I noticed was that even the great troubleshooters I met weren’t very self-aware of their processes. That is, they were good at problem-solving, but had a hard time explaining why they were good. I’ve begun to speak using terms like “isolation,” “narrowing,” and “prerequisites for operation” (a category that includes troubleshooting’s most famous question: “Is It Plugged In?”). These concepts have come up when speaking with other troubleshooters, but rarely as conscious strategies being explicitly employed. Therefore, I think that just by bringing the core concepts to the surface and assigning names to them, we can raise the quality of troubleshooting in the world. Because often giving a name to something is enough to bring it into your awareness and adopt its use. I believe that how we talk about things matter, so nothing would please me more than propagating a lexicon of terms that troubleshooters everywhere can use when discussing problem-solving.

The Goal

Unlike other works that may deal with troubleshooting a specific system or ones that have tips geared towards a particular field or industry, I’m interested in identifying the general principles needed to bring any system back to “working.” That is, an actual set of practical strategies that will help you fix anything from a toaster to a nuclear reactor. Human beings have been depending on machines for hundreds of years now, so it should be no surprise that almost every scheme you can think of to fix them has been tried! Of course, not every strategy is equal and so you’ll find the ones I’ve chosen to be tops for simplicity and effectiveness. These are the best of the best.

The other area in which I believe I can make a significant contribution to the troubleshooting arts is what I’ll call “the human side.” If there is a deeper lesson to be gleaned from my experience, it’s that all machine problems are human problems. To some people, that will be disconcerting, especially if you think that the “world of machines” is different, an area separate from humans. At first, I resisted this insight because I saw machines as a refuge from the petty goings-on of the human race. Machines might be unintelligent, but at least they weren’t malicious and inane! But woe to the troubleshooter that ignores the human side of the equation: your effectiveness will be severely limited. You can go so much further if you take into account this critical dimension of troubleshooting:

  • By becoming an expert in untangling the language people use to describe their problems.
  • By leveraging the psychology of problem-solving.
  • By getting inside the head of an operator experiencing a problem.
  • By knowing the economics behind what you produce and how that affects the resources available for repair.
  • By looking at the context in which troubleshooting takes place: not just the surrounding and supporting systems, but the team and the larger organization of which it is a part.

Lastly, and perhaps most importantly:

  • By understanding yourself, and how you best function as a troubleshooter.

Implicit in communicating all this hard-won wisdom is the belief that anyone can learn how to be a good troubleshooter. I certainly wasn’t born to fix machines, it was a skill I’ve acquired over time. How much quicker my development would have been if I had access to the ideas I’ve collected for you! All the principles I’ve written about, I’ve personally seen in action. I know that troubleshooting can be learned: everything here I have either successfully taught to someone else or learned from others. If I thought that troubleshooting was just something you were destined to do, I wouldn’t have bothered writing this.

Further, I want troubleshooting to be recognized as a field in its own right. I hope some day that being acquainted with the principles here will be considered a basic life skill for anyone wanting to better themselves (like knowing how to drive a car or use a computer). The funny thing about troubleshooting is that everyone does it, but there is little formal training or even awareness of it as a discipline. I’ve never seen a class offered that was devoted to the topic in a general way. Consider this a first attempt at cataloging the entire field, a Linnaeus-inspired work to establish a framework for others to extend. I dub thee Homo Fixitus.

It’s beyond the scope of this work, but the principles here also contain lessons for problems in everyday life. If “all machine problems are human problems,” you will see applications far beyond your workshop. I leave it up to you to figure that out on your own…

Defending The Obvious

I include a lot of “simple” strategies, like asking “Is it plugged in?” However, I also point out that this simple strategy is representative of a bigger concept called “prerequisites for operation.” Even so, you may think, “All he’s doing is pointing out the obvious!” If all I accomplish here is to give you a fresh appreciation of what’s in front of you, I’ll take that as a compliment. Troubleshooting is about the solution staring you in the face, if you’d just get out of your head and focus on the situation. Unfortunately, what’s “obvious” is often ignored! If I can merely acquaint you with what’s already known to work, you’d be well on your way to mastering the art of troubleshooting. On top of that, the original contribution I hope to make is to show you the rich and complicated implications of these “simple” and “obvious” strategies alongside a panoramic view of our advanced industrial civilization.

About The Terms I Use: Systems, Devices, and Machines

For the purpose of this work, the terms “system,” “device,” and “machine” are interchangeable. You may think of a mechanical contraption when you think of a machine, but here a “machine” can also be digital, or have no physical presence at all except as an abstract process (like a computer program). This broad definition encompasses a wide variety of things: assembly lines, computers, internal-combustion engines, network routers, airplanes, mobile phones, water heaters, nuclear reactors, software, etc. Basically, anything that accomplishes work and can malfunction.

Also, remember that “systems,” “devices,” and “machines” are themselves composed of smaller subsystems (often “machines” in their own right).

Now, Dive In

The Art Of Troubleshooting is divided into 3 main sections:

  • Strategies: recipes that help you get from “broken” to “fixed” in the shortest time possible.
  • Virtues: the mindset and behaviors of a good troubleshooter.
  • Cleaning Up: learning from a failure and preventing it from recurring.

*** Questions? Comments? Feel free to leave your feedback in the comments section below! ***

5 Comments

Add yours →

  1. I am definitely excited to read more. As a newbie to the game with no real work experience it will be fun to learn about the broad concepts as I begin to actually apply them in practice. I have your first blog related troubleshooting problem too. In Google Reader, words in this post are often getting smashed together for what seems like no reason at all.

    Like

  2. Thanks for letting me know about the problem with Google Reader. In GR, I was able to replicate the smashed text issue you noticed with both Firefox and Safari (troubleshooting principle at work: “duplicating the problem”).

    I'll investigate further and see what I can find out. In the meantime, I can tell you that the site looks fine in either browser when viewed at: http://www.artoftroubleshooting.com/

    Like

  3. In game development we call “duplicating the problem” reproducing.

    I very much look forward to reading more. This is wonderfully written and a thought-provoking approach. Thanks for sharing with us!

    Like

  4. I think I've figured out the missing spaces problem, which happens when there is a line break with no space after it. For example:

    yes yes blah[LINE BREAK]
    blah okay okay

    This will appear as “yes yes blahblah okay okay” in Google Reader (even though it looks fine when viewing the blog outside of GR). You can also see the lack of spaces in the raw feed at http://www.artoftroubleshooting.com/feeds/posts/default.

    I took the raw HTML, slapped it in a text editor and did some find/replace magic to add a space before every line break:

    yes yes blah[SPACE][LINE BREAK]
    blah okay okay

    After clearing my browser cache, the spaces now show up in Google Reader (as well as the feed). Victory!

    Like

  5. Nice and helpful information for me in this moment. Thanks

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: