Is Troubleshooting A Science?

To Gillet, it was all sort of an experiment, and “as it turns out I had just enough to do it.” The results of the experiment are still inconclusive; Gillet guesses that if he attempted the crossing in the same manner ten times, that he would die on five of the attempts.

Dave Shively, The Pacific Alone

When I was a guest on the I Love Data Centers Podcast, host Sean Tario broached the topic of methodology, asking me about the “art” in The Art Of Troubleshooting. He thought the methodology of fixing things I laid out in my writing seemed to be more formal and rigid than what was implied in the seemingly looser term of “art.” I replied that repair can be very rigorous and structured. Not everything can be fixed with a simple reboot, so there have been times where I’ve meticulously collected data, formed & tested hypotheses, and attacked a difficult problem with a disciplined regimen worthy of a Rocky montage.

I also brought up the more human aspects of repair, saying “there’s also a part of troubleshooting that’s more about experience and intuition…” (54:07). Those processes might sound fuzzier than they actually are: while hardened repair veterans seem to be guided by a divine auto-pilot, it’s actually just the deep internalization of their unique fix-it education. Over time, all the victories and setbacks, large and small, are eventually encoded into our unconscious habits.

During that interview, I used the S-word. Of course, I’m talking about science (what did you think I meant?). Since then, I’ve been thinking about the ways of the Troubleshooter; after all, method is very important to understanding how a discipline works. When I give advice about repair, I try to keep reason front and center. However, can we go further and say that troubleshooting is a “science” or that your approach to fixing things should be “scientific”?

Boldly Going Into The Unknown

Most people would rather not peer deep into the soul of a broken machine, meditating on its metaphysical qualities. Instead, they’d prefer to take the easy way out. If you could instantly tell someone exactly what’s causing a malfunction and how to fix it, surely they’d like that. After all, the quicker someone can get on with the rest of their life, the better.

Troubleshooting, in this abbreviated version, can simply mean following the prompt of a machine-generated suggestion. Acting on a “refill paper tray” alert or a “low battery” warning light are low-effort paths that reduce fixing to the simplicity of being aware and following instructions. This relentless march towards making fault-finding an automated process is vastly preferred by consumers, who acquired a product for the benefits it confers on their lives, not for the joy of repair.

What happens when the most common fix-it problems become readily known and easily solved? I’ll tell you: what’s left will be the frustrating and interesting troubleshooting conundrums that will lead you into the unknown. Out on this ledge of uncertainty, the best course of action is at best a contested matter of opinion. Luckily, this is what makes life interesting: if it all was settled and known, there would be nothing new to learn or figure out. When there’s no well-worn shop manual or checklist to follow, we grasp out in the darkness for a framework to give us structure, a general direction to start heading towards. At these moments of uncertainty, when we must choose from the myriad paths in front of us, we tend to fall back on the human virtues of the art of repair, like experience and intuition.

That might be fine for the best among us, but what about those with limited experience? What happens if your intuition is weak? I’ve tried to supplement this gap by becoming a collector of the best troubleshooting strategies, which are general enough to insert your problem into, yet specific enough to nudge you in a direction that has worked for other fixers. However, maybe there’s a better way to repair, one based on science instead?

Science, Abused & Praised

In many of the contentious debates in the public sphere, you’ll notice there’s often a battle over who will claim the mantle of scientific approval. From nutrition to climate change to pandemics, there are points to be scored if you can convince people that the science is “settled” or “on your side.”

Unfortunately, for those advocates whose interest in science is merely a tool of persuasion, support for it can be weak and easily crumble in the wake of a contradicting discovery. When the ends (political or social goals) are considered more important than the means (science), the means are easily jettisoned. Cherry picking—emphasizing facts that benefit your cause—is common too. Please leave that tasty harvesting to the orchard.

It’s easy to see why there is a fight over who can claim scientific respectability. Science is generally well-regarded in the public’s mind; we associate human progress with the rise of science. In an era of spin, propaganda, clever omissions and paid advocacy (both hidden and overt), it’s advantageous to be associated with an impartial and verifiable truth. Saying that out loud, I’m getting excited and want in too! It seems we should at least consider that troubleshooting could be scientific…

But First, What Is It?

Let’s lay some groundwork. The dictionary has a few different definitions of “science,” but for the purposes of this article, I want to focus on the meaning that science is a body of knowledge acquired through the scientific method. From the big book of words:

science noun

3a: knowledge or a system of knowledge covering general truths or the operation of general laws especially as obtained and tested through scientific method

Merriam Webster

If people ask “Can we use science to figure this out?” or “What does the science say?”, this definition is the one that is implied. When you say “science” to the average person, it probably conjures up images of chalkboards, test tubes, rats in cages, and people in white lab coats. That is, people using the scientific method to run experiments, and drawing conclusions from them.

That may seem obvious, but there are broader meanings of the word “science” that include much more expansive notions like a “state of knowing,” or learning around an “object of study.” Especially when it comes to debating who has “the science on their side,” I think what’s usually meant is that a conclusion has been arrived at using the scientific method. But can troubleshooters use the same tools as scientists?

***Sidestepping universal solutions: as a Troubleshooter, you don’t need to fix everything, just the broken thing.***
(image: Nate Bell / Unsplash)

Troubleshooting Heads One Way

Repair projects proceed toward the specific. What I mean is that troubleshooting is a relentless pursuit to learn the unique set of facts surrounding a particular problem. Behind a broad problem description like a “power outage,” “computer crash,” or “engine failure,” there can lie an infinite number of causes and remedies. This means that actually repairing a particular instance of these maladies requires an understanding of the situation’s specific underlying cause.

When troubleshooting, initial problem descriptions tend to focus on symptoms and an end result that is being thwarted. The person experiencing a malfunction is typically prevented from accomplishing something: their car won’t start and so they can’t drive to the store, their oven won’t get hot and so dinner is delayed, etc. But for any given symptom, there are an endless number of possible causes. That’s because there are only a relatively small number of possibilities for a machine’s parts to be correctly configured, while there are an infinite number of ways for it to be broken. Think of how many different possibilities could be behind a car not starting: a dead battery, empty fuel tank, faulty wiring, blown fuses, broken belts, etc.

You can see we already have a full plate here: we don’t need to add any more to the endless possibilities lurking behind a group of symptoms. However, there’s an added problem in simply communicating a malfunction to others: how do you begin to describe what you clearly don’t understand? We naturally grasp for a way to relate the problem to something we and others might know, often trying to place the problem in a known category (“power outage,” “computer crash,” “engine failure,” etc.). Using these problem groupings may get recognition from our fellow humans and give us some comfort that we’re headed in the right direction, but they’re still abstractions which will need to be turned into specifics.

Whatever the source of the haziness, the best fixers quickly recognize and drill down on these generalizations, working to discover the key details so that a solution can be found. This is inevitably a narrowing: you don’t need to fix all machines, you just need to fix this machine. This toaster is burning the toast. This computer program that calculates the payroll is crashing. This nuclear reactor is too hot. The broken system, screwed up in its own unique way, is an obstacle that impedes a specific goal: making perfect toast, paying employees on time, generating electricity, etc.

While completing a particular repair is a drive toward specificity, how you get there is not always a straight line. Troubleshooting requires both the use of induction and deduction, a clever movement between the general and the specific. You may start with some bare facts, like the symptoms of a car not starting. From there, you can “step up”, grouping and attempting to fit them into a broader form, like an “electrical problem”. Doing so may put familiar tools in your hand and nudge your feet down a known path (e.g., getting out your multimeter, testing continuity, checking the battery’s charge, looking for blown fuses, etc).

However, if you aren’t successful, you’ll have to step “down” again and look for another model to guide your actions (for examples of effective mental models for repair, see the numerous strategies I’ve written about). Theory leads right into practice: you might have reams of possibilities knocking around in your noggin, but Troubleshooters aren’t rewarded for fixing things in their heads. Repair is an applied discipline, so abstractions are useful only to the extent that they rectify the malfunction at hand. “Does the machine work again?” provides a useful check to those who might favor theory over practice.

Science Goes The Other Direction

The output of science seems to run in the opposite direction, from the specific to the general. The most revered scientists are those that discover wide-ranging concepts that implicate huge swaths of reality. Think of Einstein’s theory of special relativity or Newton’s laws of motion: these are attempts to describe the workings of the entire universe. That’s as big as it gets!

To make these broad generalizations, scientists have to be very careful about the particulars of the things they study. Observing just one forest, one human, or one planet may not be enough to confidently say something about all forests, all humans, or all planets. That’s because just a single counterexample can bring down the whole edifice of a universal theory. If all swans are supposedly white, then you’ve got some explaining to do if you encounter a black one.

The underlying goal for most scientific experiments is to be able to take the specifics of a group of subjects and project them onto the larger world. Whether eggs are a good or bad dietary choice for just one person is of limited use to the rest of humanity (sorry, Bob). People want to know if eggs are good or bad for women. Or men (you’re welcome, Bob). Or senior citizens (how old are you, Bob?). Or babies. Or high-performance athletes (time to hit the gym, Bob).

Scientists are an observant bunch who always think that they are on the brink of a REALLY. BIG. DISCOVERY. Given that they are watchful and motivated, they are vulnerable to the dangers of heroically extrapolating from what they see. Imagine if science was simply a matter of reporting what you saw, with your observations immediately generalize-able for all things and all times. Science would then just be a matter of reporting, like journalism (well, like good journalism!).

Firmly In Control, From A→B

Scientific protocols have safeguards to prevent hasty conclusions. Front and center are measures called “controls” that encourage scientific investigators to take a skeptical approach to their hunches that “A causes B”. Merely observing a link between a cause and its supposed effect is only the beginning. If your experimental hypothesis posits that A inevitably leads to B, you must also verify that the absence of A will not also cause B.

For example, if you want to see the effects of a new drug, you might give the drug to a group of ill rats and note the effect. However, you also need to not give the drug to an additional group of similarly ill rats (the control group) and observe the results. Only by comparing the two groups can you know for sure that the drug alone is the source of the effects you observe. If both groups of rats recover from their illnesses at the same rate, then the drug’s role as the agent of change is in severe doubt. Controls establish a baseline, a benchmark for changes to be measured against.

The use of experimental controls is a cornerstone of the scientific method, the kind of thing you might learn in an elementary school lesson and then promptly take for granted the rest of your life. Sure, there are other bulwarks to prevent the false conclusions of scientists from bothering us: there’s always peer review and the possibility of duplicating experimental results in another lab. However, I always thought it was neat that skepticism is built-in to the discipline of science, always encouraging investigators to be circumspect. Designing a controlled experiment at least prompts the thought, “I guess there’s a possibility that A might not cause B…”

***Who’s in control of this experiment?***
(image: Marion Post Wolcott / Library of Congress)

Your Life Is Out of Control

Before we draw the connection to troubleshooting, I want to show the difficulty of applying the scientific method to a common human problem: choosing how to act in the face of uncertainty and scarcity. In your endeavors as an individual human being, you are singularly involved in an amazing uncontrolled experiment. Take a conundrum young people face: selecting a profession to pursue. Let’s say you’re thinking about becoming a lawyer and wondering if that would be good for you. Can science answer this question? What would it take?

I guess we’d start by stating the question a bit more scientifically: “Will becoming a lawyer achieve X for me?” As for X, you’re free to pick any measurable metric you want: health, wealth, fame, defendants defended, cases won, homes bought, children begot, spouses divorced, etc. For our experiment, it would start easily enough: we’d follow your legal career from beginning to end, when hopefully you’d retire as Senior Partner from the Big Firm and get your gold watch (or is it a golden gavel?). We’d then assess what happened: did becoming a lawyer cause the desired outcome?

This longitudinal study of your life may be interesting and the tracked parameters could be completely objective, but it isn’t science. Not yet, at least. We also want to know: did choosing to become a lawyer, a choice you had among many professions, solely cause the various outcomes, good or bad? That’s because, when deciding to become a lawyer, you’re implicitly turning down other professions—which may be even better for you! The problem with just observing your career as a lawyer is that these alternatives are not being examined and compared.

Establishing causation from a single observation is problematic, because no variables are being manipulated. Merely noting the outcome of a process, without isolating and changing the inputs, invites speculation and disagreement on what would happen if you did—change the inputs! Luckily, isolating and measuring the effect from a suspected cause are exactly what a well-designed experiment tries to achieve. So, what would it take to get scientific with our career dilemma? Now, we’re talking! Let’s add some controls and play with some variables: we need to have you not become a lawyer (i.e., choose another career) and track those paths as well. In short, we want to compare the lawyer you with the not-lawyer you.

Our modest plan: we’d start by cloning you. Let’s take 1,000 of yourself and send yous on your merry way to becoming a lawyer. We would obviously need to send you to different law schools, as it would be very awkward to run into yourself around campus. For 1,000 more clones, we’d choose a variety of alternative careers: hippie jam band festival organizer, plumber, repo man, museum curator, pet detective, etc. Oh, and add another 1,000 clones to sit home and slack off, just in case doing nothing somehow leads to life satisfaction.

Over the years, we’d periodically note some key statistics for all of the various yous: height, weight, marital status, offspring, annual earnings, drug and alcohol use, etc. Finally, at the end of 40 years, we’d analyze all the data. Given that we’re going to send you back in a time machine, we would have the luxury of waiting until the very end to analyze all the results from all your clones (it’s going to take a while to interview them). Plus, you’d be free to change your mind, at any point, on what it means to have a “good career”: along the way, we could easily add data points to collect and questions to ask in our “exit” interviews. Great! Now, hop in, set the dial to the year you graduated from high school, accelerate to 88 miles per hour, and nudge your former self to make the best decision.

This all sounds like a bad sci-fi comedy, but the reason is that we’ve chosen the wrong tool: the scientific method seems unsuitable for this particular dilemma. You have just a single life to live and this singular nature is not well-suited to simultaneously testing multiple career paths. You can’t ask, “Does science say becoming a lawyer is the best option for me?” and then forget to run a controlled experiment. That’s a bush-league move: you’d have to hand over your white lab coat if you made such an amateurish mistake. Minus controls, we can’t know the most basic of counterfactuals; therefore, science will have difficulty saying that becoming a lawyer was uniquely the cause of anything, good or bad, in your life.

Since we don’t have time machines, another huge problem with “living scientifically” are the omnipresent opportunity costs of simply being alive. You can’t go backwards in time for a do-over, so every moment expires all the possibilities of what could have been if you had taken a different path. Your teenage self that’s seeking career guidance, needs it now. Knowledge gained through scientific experimentation may help you in the future, but it can’t be retroactively applied to the past. The creation of scientific insights has a price: setting up an experiment takes resources and time, a cost which you may be unwilling to bear. Lastly, goal-oriented actors (like humans) constantly adjust to incentives and feedback. An inanimate object, like a rock, doesn’t respond to its environment like we do (pro tip for scientists: check out those nifty rocks—at least they’ll do what you tell ’em). If you found out that you hated practicing law, you wouldn’t suffer through a 40-year career as a lawyer—just to complete an “experiment”!

Of course, that doesn’t mean that we just throw up our hands; the question of a career path can be answered in many different reasonable ways that are feasible and affordable (and don’t require a time machine). You can read biographies of famous lawyers, interview practicing attorneys, get an internship at a law firm, attend a trial at your local courthouse, and tour law schools. You can research starting salaries, analyze job placement rates for various law schools, join a moot court team, and talk to recent law school graduates. You can take a test that matches your personality type with the most suitable occupations, get the opinion of an industrial psychologist, and ask your friends and family if they think it’s a good fit. And, if being a lawyer doesn’t work out, you can always choose to pursue something else. You can do all of this, but it’s not directly applying the scientific method to your question.

Let’s see, we could try this spare. Or…conduct a double-blind randomized controlled experiment, then wait for our paper to be peer reviewed and for the results to be replicated by labs around the world. Your call.
(image: John Collier, Jr. / Library of Congress)

The Clock Is Ticking

When it comes to repair and science, you’ve probably already drawn the parallels with our confused career-seeker. For starters, troubleshooting problems are usually singular in nature. You might have one broken car, or one spewing printer, or one errant computer; it’s rare to have 100 broken cars, 100 spewing printers, or 100 errant computers (the kind of numbers needed to run a controlled experiment and get a statistically significant result). Our resources are finite and we employ machines for our purposeful ends, so it’s rare to have copious extras on hand to test counterfactuals and provide adequate experimental controls. Also, just like our aspiring attorney, we have time constraints: we need things fixed now. That broken machine is preventing you from accomplishing something important, so the favored solutions will be expedient (i.e., the opposite of the slow, careful introspection needed for scientific exactness).

I do want to point out that there are scenarios where repair can successfully interact with the scientific method. This typically happens in large-scale industrial situations, where you’re dealing with masses of identical components. For example, imagine that you run a data center, with thousands of computers containing many more thousands of hard disk drives. You buy in bulk, so you have many identical drives of the exact same make and model being used in these servers. Plus, the sheer size of your deployment makes it economical for you to attempt and track repairs in-house. This is a great setup for a rolling, controlled experiment!

That’s because a data center with thousands of hard drives will have failures—constantly. I managed a server farm that would be considered modest by today’s standards, and not a week would pass without at least one disk giving up the ghost. You can see how this setup would be ideal to practice science: with each new failure, you could experiment with various types of repair (replace the platter? controller? connector? firmware?). If you were tracking operational data on your drives, you could begin to compare the effectiveness of these different kinds of repairs as you put the drives back in service and monitored their health. Finally, that pile of broken hard drives in the corner—there’s your control group!

Beyond end users, manufacturers have both the incentive and scale to pull off scientific troubleshooting studies. Again, if you’re dealing with masses of duplicates, you have an ideal situation for the segmentation needed to test alternative repair theories. Before a new product is born, the prototyping and testing phase offers ample opportunities to track breakdowns and the efficacy of various responses to them. When a product is released “into the wild”, real world usage by customers offers even more chances to collect and analyze failure data in a scientific way. Automobile makers are a great example of this setup: there might be millions of a particular model on the road, and a network of dealership service departments collecting data could be your “lab notebook”. Just like a control group in an experiment, the identical nature of mass produced goods provides a clear lens to observe the result of various changes: over time, the completely homogeneous group of test subjects can highlight the effectiveness of different ways of dealing with a failure.

Another field that is conducive to a quasi-scientific troubleshooting approach is software. When I was writing code, there were many opportunities to use controls, just like in a real laboratory. I could take a malfunctioning program, attempt to fix it by making a single modification (note: the “change just one thing at a time” troubleshooting principle is straight out of the science playbook), and then test it against prior versions. Running these tests in the same environment (hardware & operating system version) provided consistency to the results; this would be analogous to making sure a lab was cleaned and set up the exact same way for each round of an experiment. Since I worked on a computer cluster, running thousands of tests simultaneously was both fast and easy. This way of testing software has all the elements of a controlled experiment: manipulating a single variable (making a change to the program), gauging its effects (examining the altered program’s output), then comparing the results to the control group (noting differences to prior versions of the program, without the modification). This methodology was great at isolating the consequences of a particular change—exactly what scientific experiments aim for!

If you are fortunate to be working on a system that has genuine science-based troubleshooting data available, please take full advantage. But, I suspect that this won’t be the case for most repairs you undertake in your life. Most fixes, even those recommended by manufacturers, are discovered through trial and error, or are simply the application of logic to a Machine Model (“this is how it’s supposed to work, so do this when it breaks…”). It would be great if every fix could be verified through a controlled experiment, but again, scarcity has the last say. There is a cost to create truly scientific repair information, and likewise for you to seek out and employ it in your hour of need.

References:

Header image: Rothstein, A., photographer. A milk tester for the Triple B milk testing associations. His equipment includes test tubes, pipettes, a centrifuge and laboratory apparatus seldom found on the farm. Black Hawk County, Iowa. 1939. Nov. [Photograph] Retrieved from the Library of Congress, https://www.loc.gov/item/2017779311/.

	Shawn Hubbard on The 50 Percent Rule: Repair Or…
	Phil Frohne on Read The Art Of Troubleshootin…
	Jason Maxham on About Me
	Chuck Carter on About Me
	Michael Griffith on The Phone Is Ringing, So Answe…

Share this:

Related

Leave a comment Cancel reply