“There are a thousand hacking at the branches of evil to one who is striking at the root.”
Henry David Thoreau
Machines don’t exist in nature. You’ve never turned on the Discovery Channel to see a program called “Stalking the Wild Internal Combustion Engine in North Africa.” Therefore, the root cause of all system failures originates with decisions made by a human being (or a group of people). Let that sink in. While I sympathize with the feelings of many an anti-social engineer, who would like nothing more than to build beautiful machines while being left alone, avoiding the human context will seriously impair your ability to be an effective troubleshooter.
The crisis is over. You fixed it, the client is happy, the production line is rolling again, the money spigot is flowing. You’re done, right? Wrong. There’s another step you should take that will take you from good to great: taking what you’ve learned from the incident and feeding it back into your organization. If you want things to improve and prevent failures from recurring, you need to make learning from your malfunctions a part of your process. Don’t let it happen by accident, this “last mile” needs to be a priority and given adequate resources.
Let me introduce the field of Root Cause Analysis (RCA). Much ink has been spilled on this topic; you could fill an entire library with books about RCA. Beyond that, there are many management systems that integrate some aspects of RCA as part of a larger regime (GE’s famous Six Sigma, for example). We won’t get into all that, because this chapter is simply about making you aware of the need for something like RCA, rather than advocating for a particular system. It’s up to you to choose and experiment with a particular one. Don’t worry, an MBA isn’t required nor will you have to spend months in boring seminars to benefit from RCA. Later on, I’ll show you a very simple and powerful version of RCA that may be all you ever need.
First, let’s go over the basics that should be included in an effective root cause analysis process. Learning from a failure requires these essential steps:
- Precisely defining the problem and explaining why it needs to be solved. People need to understand why they’re being asked to take time away from their responsibilities to participate. Defining the problem is its own challenge: I’ve seen this part of the process take considerable time. Resolving these disagreements is worth it, because they highlight the differing perspectives people bring to the process (e.g., customer service vs. engineering vs. legal). A good way to define a problem is in terms of its external impact: this deadline was missed, this client was inconvenienced, this person was injured, etc. Focusing on the impact of a problem adds emotional weight and provides the obvious reason for pursuing improvements. You’ll need this energy because the recommendations stemming from a RCA investigation can have a large upfront cost. In other words, it’s going to be more work for somebody.
- Gathering information: data, perspectives, assembling a timeline, quantifying the impact. What happened? When? Where? How much? These details are crucial to identifying causes and will make sure you’re fixing the right thing. I’ve had investigations end prematurely because of missing information: the only recommendation you can make in these type of cases is to start collecting data so you’ll be prepared next time. Finally, quantifying the effects of an incident can be a powerful force to drive change: people not directly involved can be amazingly ignorant of the effects of a failure. You’ll need to show them “how much” to make them care.
- Tracing the problem back to the root(s). Not superficial proximate causes. Not “the car ran out of gas.” If you’re doing it right, the problem will always lead back to…you guessed it: people, processes, or policies. There can be no other answer. We already know that all machines will eventually break down; when that happens, they’re just doing what’s expected. Someone designed that machine, someone chose to deploy it for a given purpose, someone was responsible (or not) for maintaining it and on and on. The root cause of a failure is never within the machine itself.
- Generating possible solutions to prevent the failure from happening in the future. I like to cast a wide net in this phase: the more options the better.
- Identifying your resources, then choosing a path. The options you’ve generated will have varying costs and benefits, so you’ll need to decide amongst them. Or, advocate for additional resources if you feel the problem is important enough.
- Follow-up: implement, measure, review, respond, evangelize. This is the phase where many RCA projects stumble and die. Not only do you have to actually implement the recommendations coming out of the process (that is, do some work), you should also agree on a way to measure the effectiveness of your efforts. The data you collect for this purpose needs to be evaluated and responded to: be ready to change course if reality doesn’t match your theory. Finally, spreading the good news is very important for ongoing participation by your co-workers. If you can show everyone that your RCA process is making their lives better, you won’t have to bribe them with cookies to attend your meetings (although that helps too).
Having the backing of your organization is important before starting an RCA program. I guess you could go rogue and dish out RCA-based vigilante justice on your own: a geeky Batman armed with only a spreadsheet and the question “Why?” The problem is that your ability to implement what you learn from any RCA process will require resources. Therefore, be sure to enlist the help of someone with decision-making power. Your co-workers need to believe that This Is Important. RCA represents change, and people fear change. They will have to attend a meeting and you know that everyone loves attending meetings (although I personally think the 5 Whys process is fun). Finally, the recommendations will likely make more work for someone (at least in the short term). “Why are we doing this?,” will inevitably be asked by someone. It’ll either be done in the spirit of “My time is precious and so why should I give it to you?” or “You’re cutting into my FaceBook time and so why should I give it to you?” Whether you work with people who take pride in their work or unapologetic slackers, you should have an answer that will appeal to the best in them (for the rest, you’ll have to convince them the old-fashioned way: with carrots and sticks).
Where To Start
If you’ve got a reluctant manager, offer to start with just a single issue and see how it goes. There’s no need to commit to a full-blown RCA program before seeing the value with your own eyes. When deciding what issue to tackle first, consider starting small as well. As stated previously, the best topics are those that resulted in an impact that’s easy to point to: something that harmed your customers, employees, the bottom line, or your reputation. If you’re just starting to think critically about your operations, it’s better to focus on bad things that have actually happened, versus hypothetical risks because:
- If it happened once, it’s likely to happen again. A problem that has already occurred has nominated itself from amongst all the other probabilities. Go with the flow and take this gift.
- Nothing focuses the mind and rallies the troops like a threat to your organization. It also bestows upon you the moral high ground to make change.
Assemble The Players
There’s always the question of how many people to involve in your RCA meetings. I’d aim for a “medium” amount of participation because there are two counteracting forces at work:
- Why you want to include as many people as possible: you’ll need to have enough perspectives and information in the room so the problem can be understood and learning can happen. If you’re missing too many key people, you’ll be left with questions that can’t be answered. You’ll have frustrating moments where you’ll ask a question and the room will be silent: “So–and–so knows, but he’s not here…” If it’s really important, you’ll have to pause the meeting or suspend it until the right people are present.
- Why you want to include as few people as possible: among the “resources” you’ll have to manage during any RCA meeting are your participants’ interest and caring. If you have too many people in the room, you’ll dilute this precious human commodity. There’s a natural limit to the size of an effective meeting. As the number of people grows, the logistics of managing the attention span of a large group takes its toll. A smaller, energized team that is empowered to make improvements is preferable to a stadium-sized crowd.
So, you see the right answer is somewhere in between. A big enough group to know, and a small enough group to care.
Define The Problem
It may sound trivial, but sometimes defining the problem will be the biggest challenge for a RCA meeting.
Scope: problems defined too narrowly and not in the context of larger meaning to your organization will lack the psychological punch to get action. “Drive shaft with serial number #2319310 failed at 3:45pm in Chilling Unit #5” may be accurate, but won’t inspire action. Back to the point above about defining the problem in terms of its impact to your organization.
Problems defined too broadly may prevent the team from taking any action at all. Things like “our clients are unhappy” or “our industry is collapsing” are too amorphous. Those kind of large-scale problems can be fascinating to debate, but your group will likely lack the resources to pursue a solution. Also, people will tune out if they think the problem is so large that they will be unable to make a difference.
Perspective: if you bring together people from different teams, you may encounter a subtle form of finger pointing. People can be parochial in viewing a problem only from their perspective. This isn’t bad, it’s just not always useful. Framing the problem in the spirit of blame (e.g., “the maintenance department screwed up”) may contain kernels of truth, but isn’t a perspective that will lead to the best insights.
The proper perspective for framing a problem is external (i.e, the customer’s). Everyone will share a basic level of empathy for this view because their job depends on having customers, and so it places all participants on the same side. Soliciting input from a broad cross-section of people involved in an incident ensures exposure to all the relevant viewpoints. I’ve done 5 Whys by myself, but it’s much more powerful in a group setting because of this cross-pollination of perspectives.
5 Whys: Tracing Consequences Back To Roots
There’s a famous proverb about a nail:
For Want of a Nail
For want of a nail the shoe was lost.
For want of a shoe the horse was lost.
For want of a horse the rider was lost.
For want of a rider the message was lost.
For want of a message the battle was lost.
For want of a battle the kingdom was lost.
And all for the want of a horseshoe nail.
In real life, the causal chain isn’t usually this poetic, but I think you get the point: seemingly small things can have large consequences. “For Want of a Nail” is a great segue to the 5 Whys method of root cause analysis, which not only shows the impact of small causes, but also uncovers hidden ones.
The 5 Whys process was developed by Sakichi Toyoda and helped Toyota Motor Corporation become the largest automotive company in the world. Taiichi Ohno, the architect of Toyota’s Just-in-Time production system, describes 5 Whys like this:
When confronted with a problem, have you ever stopped and asked why five times? It is difficult to do even though it sounds easy. For example, suppose a machine stopped functioning:
1. Why did the machine stop?
There was an overload and the fuse blew.
2. Why was there an overload?
The bearing was not sufficiently lubricated.
3. Why was it not lubricated sufficiently?
The lubrication pump was not pumping sufficiently.
4. Why was it not pumping sufficiently?
The shaft of the pump was worn and rattling.
5. Why was the shaft worn out?
There was no strainer attached and metal scrap got in.
Repeating why five times, like this, can help uncover the root problem and correct it. If this procedure were not carried through, one might simply replace the fuse or the pump shaft. In that case, the problem would recur within a few months.
To tell the truth, the Toyota production system has been built on the practice and evolution of this scientific approach. By asking why five times and answering it each time, we can get to the real cause of the problem, which is often hidden behind more obvious symptoms.
Taiichi Ohno, Toyota Production System: Beyond Large-Scale Production
The 5 Whys process invites you to look below the surface of “proximate” causes, because acting at this level rarely leads to lasting improvements. It may be satisfying to ask “Why won’t the car start?,” and answer, “Because the battery is dead.” However, stopping at the proximate cause will lead you to believe that inanimate objects have minds of their own. “It’s the battery’s fault!” would be the inevitable conclusion of this kind of thinking, but of course yelling at batteries like a madman won’t change anything. You might chuckle at that imagery, but any time a machine is blamed for a failure it’s the same thing.
If people are the source of all system breakdowns, it’s reassuring to know that they are also the solution. The entire theme of this book, that all machine problems are human problems, struck me like lightening during a 5 Whys meeting.
The Washington Monument Story
Let’s look at another example of hidden causes uncovered by the 5 Whys method, in this oft-told story about the Washington Monument*:
Problem: the Washington Monument was crumbling.
Why was it crumbling? Because harsh chemicals were being used on the monument.
Why were the harsh chemicals being used? To clean off all the pigeon poop.
Why were there so many pigeons? The pigeons were attracted to the spiders (pigeons eat spiders).
Why were there so many spiders? The spiders were attracted to the gnats (spiders eat gnats).
Why were there so many gnats? The gnats were attracted to the artificial lights turned on at dusk.
Solution: Turn on the lights at a later time. That would attract fewer gnats, which would mean fewer spiders, which would mean fewer pigeons, ultimately lessening the need to use those harsh chemicals. Problem solved!
You can see here that we asked “Why?” five times in a row and came up with an astounding result! Who would have thought that turning the lights on at dusk would lead to a crumbling monument? Also, seeing the full causal chain that led to the problem gives you many possibilities for a solution. Let’s say that turning on the lights later was not an option for whatever reason (maybe you need them on for safety concerns). You could, separately or in combination, consider reducing the amount of gnats, spiders, or pigeons. This is in addition to fixing the problem at the proximate cause level by finding a pigeon poop cleaner that isn’t corrosive. It’s good to have options!
*Note: This story of the Washington Monument convincingly shows the power of 5 Whys in discovering the root cause of a problem. However, I have been unable to confirm that it’s a true story. Based on my research, the story is widely cited all over the Internet. It truly is the king of examples for Root Cause Analysis and you’ll find it in many a PowerPoint presentation. The problem is, I couldn’t find a single reference to an original source verifying that the story is real. I’ve posted a question on Quora asking about the story’s origin, but until someone comes forth with evidence of its veracity, I’d treat it as a very good myth of Management Consulting.
Update (April 14, 2015): Joel Gross has done some intriguing research that sheds light on the origin of this mystery. In “5 Whys Folklore: The Truth Behind a Monumental Mystery,” he makes the case that the Washington Monument story has its origins in an unpublished National Park Service report by Dr. Don Messersmith (“Lincoln Memorial Lighting and Midge Study“). While sharing some elements with the real account, I’d say the popular parable is more legend than fact. However, several intriguing questions remain to be answered. First off, how did we get from that unpublished NPS report to the urban legend we have today? More importantly, should we stop using the Washington Monument story to introduce people to 5 Whys? As a teaching tool, I’ve found that this drama of causality has that “aha!” factor. People instantly get the message that root causes are often buried and require probing to uncover, as well as the power of unintended consequences.
Trace The Causes
To conduct a 5 Whys meeting, it’s best to use a large whiteboard to note the progress of your investigation. Start by writing down the problem on the left-hand side of the board. From there, you’ll proceed through the 5 levels of whys, branching off as appropriate for places where there were multiple causes. When that happens, be sure to get all of them before you move on to the next level of “Why?” Don’t get fixated on the number 5 either, there’s nothing magical about that 5th Why. In practice, I’ve found that root causes can appear as soon as the 2nd or 3rd Why. Don’t go beyond that if it’s not useful. You’ll know when you’ve reached the end of a line of inquiry because the next “Why?” would result in a question that’s silly or obvious. We once played “100 Whys” and ended up with “Why do people have to eat?”
Let’s look at an example of a 5 Whys inquiry in a table format. The problem is on the left and the 5 levels of “Whys?” take you to possible solutions on the right:
|Problem||1st Why||2nd Why||3rd Why||4th Why||5th Why||Solutions|
|Our client, XYZ Inc., is angry and is considering canceling their contract with us.||→||The delivery deadline for our client XYZ Inc. was missed by 7 days.||→||Water damaged the shipment and it had to be replaced.||→||A water reservoir filled up and spilled onto the factory floor.||→||Bob wasn’t at the controls to flush the reservoir as it became full.||→||Bob has other work duties he’s required to perform.||Increase staffing in general or deploy extra staff when the automated flushing mechanism breaks.|
|Variable fill rate means employees aren’t aware of how much water the reservoir is currently holding.||→||No alerting system in place to let staff know when overfilling is imminent.||Install reservoir alerting system.|
|The reservoir needed to be manually drained.||→||Automated flushing mechanism has been broken for the past 2 weeks.||Automated flushing mechanism should be repaired immediately when it breaks.|
|There’s no physical separator between the shipping staging area and the water containment system.||→||Designers didn’t anticipate the installation of the reservoir near the staging area.||→||Operations department didn’t communicate all probable uses of the space to the architects.||Install a barrier to contain flooding. Also, operations needs to formally document its planned use for a space when starting a new construction project.|
|→||XYZ Inc. wasn’t contacted about the delay until 5 days after the deadline was missed (they could have worked around the delay if they had been informed).||→||The customer service department has no visibility into shipping delays.||Report shipping delays to the customer service department so clients can be contacted.|
You can see that we’ve started with a negative external effect: a client is angry with us. From there, we’ve identified two 1st-level Whys: a shipping delay caused by flooding and a client communication problem. Already we’ve learned something, either getting the shipment out on time or letting the client know about the delay could have saved our bacon! This is why it’s so important to begin the 5 Whys process with the overarching Bad Thing: if you would have started with the flooding, you would have missed this other aspect of the problem. Also, look at all the options we have generated to prevent the problem in the future. We can attack the problem through: staffing, an alerting system, faster-responding maintenance, installing a flood barrier, or better client communication. If losing XYZ Inc.’s business is something that absolutely cannot be allowed to happen, you might choose to make all of these improvements! Most of the time, however, you won’t be able to pursue all your good ideas: you’ll need to do a cost/benefit analysis to select among them. I’ve been in 5 Whys meetings that have resulted in a long list of great possibilities for improvements, but resources only allowed us to pursue the best one.
It All Comes Back To Us
The end result of 5 Whys will typically be the lack of something: discipline (i.e, following proper procedures), training, knowledge, resources, setting of expectations, maintenance, etc. If you find yourself still describing the goings-on of inanimate objects, you haven’t gone deep enough. Keep asking “Why?” How the problem unfolded via cause and effect is interesting, but describing the situation like a news reporter is only the beginning of the process. Your systems were designed by people, purchased by people, installed by people, maintained by people and their output sold to other people (clients or customers). A long series of choices made by people is the result of any incident. Recognizing the role of human choice in system failures is different from placing blame: the people who made the choices leading to an incident were probably acting on the best information available to them at the time. Simply recognize that new information has come to light and needs to be integrated into your processes.
The importance of management in drawing the right conclusions from any kind of root cause analysis cannot be understated. If you accept that the origin of all incidents was somebody’s choice, you quickly realize that the “top chooser” (the CEO, owner, Board of Directors, etc.) is ultimately responsible but, as stated before, not necessarily to blame.
A Touchy Subject
Since you’re bound to name people as causes when you do 5 Whys, you’ll need to conduct the process with tact. In the example above, we can see that Bob could have saved the day if he was at the controls to prevent the reservoir from overfilling. People love a scapegoat, so there will be the temptation to ignore the rest of the evidence and conclude that Bob was the root cause. However, don’t be fooled: someone hired Bob, put him in a position of responsibility, overloaded him with work, and designed a system that requires constant human supervision to work properly. That changes your perspective on the true cause of this incident, doesn’t it?
If Bob was truly negligent with respect to his job duties, then perhaps he should be fired. However, if you believe that replacing Bob will stop this type of accident ever happening again, think twice. Bob’s contribution to this incident didn’t happen in vacuum: his employer created a situation that ensured maximum damage when there was a human error. People are guaranteed to make mistakes, so any system where you are just one lapse away from a catastrophe is poorly designed. A combination of multiple backstops, human and automated, will ensure that something like this doesn’t happen again. If you look at the outcome of this 5 Whys investigation, you’ll see that both types of fixes emerged as ideas for future improvements.
There will be cases where the right answer to an incident will be a change of personnel. When that happens, do a double–check and make sure you have fostered the conditions for your employees to be successful.
Empathy For Those Affected
An advocate for the client should always be present at your 5 Whys meeting. This “client advocate” could be a project manager, the salesperson who owns the account of an affected client, the person who collects the bills, or someone from a different department who is the “customer” (in the case of internal projects). The advocate will bring the client’s point of view to the process and, through their presence, the client will have “a seat at the table.” The advocate will see first-hand that you care about improving and rectifying what went wrong. Salespeople especially will feel better after witnessing your efforts to improve a bad situation. In the opposite direction, I’ve seen salespeople develop empathy for the challenges faced by the engineers who are tasked to fix a problem. This can counteract the classic tension between those who sell promises and those who deliver on them. The salesperson who is included in the RCA process will be more vigilant about not over-promising to clients. Also, what great fodder for their next sales pitch, when they can explain what went wrong and give a detailed rundown on all the improvements you’ve made!
Some Cheese With Your Whine?
At the end of a 5 Whys meeting, it’s useful to ask for feedback on the solutions that were generated during the session. The participants of your meeting can point out flaws and prevent you from pursuing fixes that embody the old saying, “The cure is worse than the disease.” Also, people get real smart when you propose creating more work for them: and when I say “smart,” I mean whiney. It’s an art to tell the difference between principled objections and the “don’t make me change anything” type of grousing. I like to call bluffs and ask “What would you have to know or see to be persuaded this is the right fix?” and then design an experiment or collect data to satisfy their concerns. If you’re right, there’s no way they can object: they told you exactly what they needed to be convinced.
Make It A Habit
Root cause analysis should be done on a regular basis. Depending on your needs, it should be on the calendar for the same time every day/week/month/quarter. Don’t have any problems in your organization that are worthy of learning more about? Contact me, I’d like to invest. For the rest, strap in for a lifetime of RCA. Having RCA occur at regular intervals means that sometimes you’ll be focusing on minor problems. That’s fine, have the meeting anyway. As a manager, I was stunned at the things I learned about my company from digging into even the “smallest” of problems—and I had been there from day one! Remember too about the “tip of the iceberg”: little problems sometimes develop into larger problems. Ruthlessly examining your small incidents may prevent you from ever having to experience a large disaster.
Too Much Of A Good Thing
If your organization is constantly in “crisis mode,” you might have enough material for a thousand 5 Whys meetings. You can schedule extra sessions, but be aware of the law of diminishing returns. Taking employees out of their daily routine is costly, and the recommendations from each investigation, while typically having long term benefits, will involve w-o-r-k. RCA is a great way to transform your workplace, but even the most nimble and well-managed organization will have an upper limit to how much change they can absorb at any given time.
Criticisms Of 5 Whys
No process is perfect. Let’s look at some problems you may encounter when using 5 Whys and how I think you can overcome them. Stewart Anderson, in his article “Root Cause Analysis: Addressing Some Limitations of the 5 Whys,” lays out the major pitfalls:
While many companies have successfully used the 5 Whys, the method has some inherent limitations. First, using 5 Whys doesn’t always lead to root cause identification when the cause is unknown. That is, if the cause is unknown to the person doing the problem solving, using 5 Whys may not lead to any meaningful answers. Second, an assumption underlying 5 Whys is that each presenting symptom has only one sufficient cause. This is not always the case and a 5 Whys analysis may not reveal jointly sufficient causes that explain a symptom. Third, the success of 5 Whys is to some degree contingent upon the skill with which the method is applied; if even one Why has a bad or meaningless answer, the whole procedure can be thrown off. Finally, the method isn’t necessarily repeatable; three different people applying 5 Whys to the same problem may come up with three totally different answers.
Other drawbacks to 5 Whys have been cited, including the method’s inability to distinguish between causal factors and root causes, and the lack of rigor where users aren’t required to test for sufficiency the root causes generated by the method.
Stewart Anderson, “Root Cause Analysis: Addressing Some Limitations of the 5 Whys”
It’s true that 5 Whys process won’t, by itself, reveal an unknown cause to you. Anderson’s criticisms are fair, but for me 5 Whys is just a framework for learning more about an incident. If you have a diverse set of people participating in your 5 Whys process, you’ll at least capture all the knowledge available in your organization. That’s frequently more than enough to solve most problems. Also, I don’t know of any process that guarantees revelations of causes unknown. Anderson is right that results are contingent upon “the skill with which the method is applied,” but this objection would seem to apply to all systems designed to acquire knowledge. Even our vaunted Scientific Method benefits from a skilled and disciplined practitioner who can adhere to the principles and draw the right conclusions from an experiment.
For those unknowns that can’t be discovered by talking it through, a perfectly acceptable outcome of a 5 Whys meeting is the realization that you need more information to complete your analysis. You may resolve to start collecting data, set up a monitoring system, or conduct an experiment to test a theory about the cause. I’ve suspended 5 Whys meetings to allow for additional time to gather information we identified as relevant. This was especially true for situations where we’d have an incident and no one could answer the question “What is normal?” Reconvening later, after a chance to acquire the missing information, we were better able to tackle the problem.
If you get stuck, don’t be afraid to put a big question mark as a placeholder in your 5 Whys tree and move on with your meeting. Sometimes the answer will reveal itself later. Even if it doesn’t, in those situations with multiple root causes, you can still fill in the other branches. In the example above, even if the cause of the flooding was unknown and needed further investigation, we still had the opportunity to discover the client communication problem.
You can read the rest of Stewart’s article if you’re interested in learning more about improving the 5 Whys process. I hope you eventually reach a level of sophistication where those suggestions are useful, or perhaps you’ll branch out into other systems. In the meantime, 5 Whys remains a very accessible entry point to starting your RCA journey.
Beyond Solving Problems
The benefits of RCA go beyond learning about any one incident and spill over into improved morale. Having a good RCA program sends its own message:
- You take improvement seriously.
- You care enough about your product or service that major mistakes and incidents will be thoroughly reviewed (i.e., problems will be brought out in the open, not swept under the rug).
- You value your employees: you seek and act upon their feedback while making them integral to the improvement process.
*** Questions? Comments? Have a related troubleshooting story that you’d like to share? Feel free to leave your feedback in the comments section below! ***
- Stewart Anderson, “Root Cause Analysis: Addressing Some Limitations of the 5 Whys,” Quality Digest, December 17, 2009.
- Taiichi Ohno, Toyota Production System: Beyond Large-Scale Production (Portland: Productivity Press, 1988), pg. 17.